[v2] Single threaded stdio optimization

Message ID 594B92ED.6060809@arm.com
State New, archived
Headers

Commit Message

Szabolcs Nagy June 22, 2017, 9:50 a.m. UTC
  On 21/06/17 17:52, Joseph Myers wrote:
> On Wed, 21 Jun 2017, Szabolcs Nagy wrote:
> 
>> created.  A new libc abi symbol, _IO_enable_locks, does this.
>> (The abilist files will have to be updated in a separate patch).
> 
> I don't see any declarations or uses of this symbol in public headers, so 
> why does it need to be at a public symbol version instead of 
> GLIBC_PRIVATE?
> 

PATCH v3: use GLIBC_PRIVATE for _IO_enable_locks.
(should i use __libc_* naming convention for private symbols?)

Locking overhead can be significant in some stdio operations
that are common in single threaded applications.

I'd like to address it in this release if possible as it
causes performance problems to aarch64 users.  I prefer this
high-level approach to be reviewed, if that does not work then
the aarch64 specific low-level approach will be taken.

This patch adds the _IO_FLAGS2_NEED_LOCK flag to indicate if
an _IO_FILE object needs to be locked and some of the stdio
functions just jump to their _unlocked variant when not.  The
flag is set on all _IO_FILE objects when the first thread is
created.  A new GLIBC_PRIVATE libc symbol, _IO_enable_locks,
was added to do this from libpthread.

The optimization can be applied to more stdio functions,
currently it is only applied to single flag check or single
non-wide-char standard operations.  The flag should probably
be never set for files with _IO_USER_LOCK, but that's just a
further optimization, not a correctness requirement.

The optimization is valid in a single thread because stdio
operations are non-as-safe (so lock state is not observable
from a signal handler) and stdio locks are recursive (so lock
state is not observable via deadlock).  The optimization is not
valid if a thread may be created while an stdio lock is taken
and thus it should be disabled if any user code may run during
an stdio operation (interposed malloc, printf hooks, etc).
This makes the optimization more complicated for some stdio
operations (e.g. printf), but those are bigger and thus less
important to optimize so this patch does not try to do that.

2017-06-22  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	* libio/libio.h (_IO_FLAGS2_NEED_LOCK, _IO_need_lock): Define.
	* libio/libioP.h (_IO_enable_locks): Declare.
	* libio/Versions (_IO_enable_locks): New symbol.
	* libio/genops.c (_IO_enable_locks): Define.
	(_IO_old_init): Initialize flags2.
	* libio/feof.c.c (_IO_feof): Avoid locking when not needed.
	* libio/ferror.c (_IO_ferror): Likewise.
	* libio/fputc.c (fputc): Likewise.
	* libio/putc.c (_IO_putc): Likewise.
	* libio/getc.c (_IO_getc): Likewise.
	* libio/getchar.c (getchar): Likewise.
	* libio/ioungetc.c (_IO_ungetc): Likewise.
	* nptl/pthread_create.c (__pthread_create_2_1): Enable stdio locks.
	* libio/iofopncook.c (_IO_fopencookie): Enable locking for the file.
	* sysdeps/pthread/flockfile.c (__flockfile): Likewise.
  

Comments

Siddhesh Poyarekar June 29, 2017, 11:41 a.m. UTC | #1
On Thursday 22 June 2017 03:20 PM, Szabolcs Nagy wrote:
> On 21/06/17 17:52, Joseph Myers wrote:
>> On Wed, 21 Jun 2017, Szabolcs Nagy wrote:
>>
>>> created.  A new libc abi symbol, _IO_enable_locks, does this.
>>> (The abilist files will have to be updated in a separate patch).
>>
>> I don't see any declarations or uses of this symbol in public headers, so 
>> why does it need to be at a public symbol version instead of 
>> GLIBC_PRIVATE?
>>
> 
> PATCH v3: use GLIBC_PRIVATE for _IO_enable_locks.
> (should i use __libc_* naming convention for private symbols?)
> 
> Locking overhead can be significant in some stdio operations
> that are common in single threaded applications.
> 
> I'd like to address it in this release if possible as it
> causes performance problems to aarch64 users.  I prefer this
> high-level approach to be reviewed, if that does not work then
> the aarch64 specific low-level approach will be taken.
> 
> This patch adds the _IO_FLAGS2_NEED_LOCK flag to indicate if
> an _IO_FILE object needs to be locked and some of the stdio
> functions just jump to their _unlocked variant when not.  The
> flag is set on all _IO_FILE objects when the first thread is
> created.  A new GLIBC_PRIVATE libc symbol, _IO_enable_locks,
> was added to do this from libpthread.
>
> The optimization can be applied to more stdio functions,
> currently it is only applied to single flag check or single
> non-wide-char standard operations.  The flag should probably
> be never set for files with _IO_USER_LOCK, but that's just a
> further optimization, not a correctness requirement.
> 
> The optimization is valid in a single thread because stdio
> operations are non-as-safe (so lock state is not observable
> from a signal handler) and stdio locks are recursive (so lock
> state is not observable via deadlock).  The optimization is not
> valid if a thread may be created while an stdio lock is taken
> and thus it should be disabled if any user code may run during
> an stdio operation (interposed malloc, printf hooks, etc).
> This makes the optimization more complicated for some stdio
> operations (e.g. printf), but those are bigger and thus less
> important to optimize so this patch does not try to do that.
> 
> 2017-06-22  Szabolcs Nagy  <szabolcs.nagy@arm.com>
> 
> 	* libio/libio.h (_IO_FLAGS2_NEED_LOCK, _IO_need_lock): Define.
> 	* libio/libioP.h (_IO_enable_locks): Declare.
> 	* libio/Versions (_IO_enable_locks): New symbol.
> 	* libio/genops.c (_IO_enable_locks): Define.
> 	(_IO_old_init): Initialize flags2.
> 	* libio/feof.c.c (_IO_feof): Avoid locking when not needed.
> 	* libio/ferror.c (_IO_ferror): Likewise.
> 	* libio/fputc.c (fputc): Likewise.
> 	* libio/putc.c (_IO_putc): Likewise.
> 	* libio/getc.c (_IO_getc): Likewise.
> 	* libio/getchar.c (getchar): Likewise.
> 	* libio/ioungetc.c (_IO_ungetc): Likewise.
> 	* nptl/pthread_create.c (__pthread_create_2_1): Enable stdio locks.
> 	* libio/iofopncook.c (_IO_fopencookie): Enable locking for the file.
> 	* sysdeps/pthread/flockfile.c (__flockfile): Likewise.
> 
> 
> stdio.diff
> 
> 
> diff --git a/libio/Versions b/libio/Versions
> index 2a1d6e6c85..6c1624e8cd 100644
> --- a/libio/Versions
> +++ b/libio/Versions
> @@ -155,5 +155,6 @@ libc {
>    GLIBC_PRIVATE {
>      # Used by NPTL and librt
>      __libc_fatal;
> +    _IO_enable_locks;
>    }
>  }
> diff --git a/libio/feof.c b/libio/feof.c
> index 9712a81e78..8890a5f51f 100644
> --- a/libio/feof.c
> +++ b/libio/feof.c
> @@ -32,6 +32,8 @@ _IO_feof (_IO_FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_feof_unlocked (fp);
>    _IO_flockfile (fp);
>    result = _IO_feof_unlocked (fp);
>    _IO_funlockfile (fp);
> diff --git a/libio/ferror.c b/libio/ferror.c
> index 01e3bd8e2b..d10fcd9fff 100644
> --- a/libio/ferror.c
> +++ b/libio/ferror.c
> @@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_ferror_unlocked (fp);
>    _IO_flockfile (fp);
>    result = _IO_ferror_unlocked (fp);

The patch looks OK except for the duplication (and a missing comment
below), which looks a bit clumsy.  How about something like this instead:

  bool need_lock = _IO_need_lock (fp);

  if (need_lock)
    _IO_flockfile (fp);
  result = _IO_ferror_unlocked (fp);
  if (need_lock)
    _IO_funlockfile (fp);

  return result;

You could probably make some kind of a macro out of this, I haven't
looked that hard.

>    _IO_funlockfile (fp);
> diff --git a/libio/fputc.c b/libio/fputc.c
> index a7cd682fe2..b72305c06f 100644
> --- a/libio/fputc.c
> +++ b/libio/fputc.c
> @@ -32,6 +32,8 @@ fputc (int c, _IO_FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_putc_unlocked (c, fp);
>    _IO_acquire_lock (fp);
>    result = _IO_putc_unlocked (c, fp);
>    _IO_release_lock (fp);
> diff --git a/libio/genops.c b/libio/genops.c
> index a466cfa337..132f1f1a7d 100644
> --- a/libio/genops.c
> +++ b/libio/genops.c
> @@ -570,11 +570,28 @@ _IO_init (_IO_FILE *fp, int flags)
>    _IO_init_internal (fp, flags);
>  }
>  

Add a fat comment here describing in detail what you're doing here and why.

> +static int stdio_needs_locking;
> +
> +void
> +_IO_enable_locks (void)
> +{
> +  _IO_ITER i;
> +
> +  if (stdio_needs_locking)
> +    return;
> +  stdio_needs_locking = 1;
> +  for (i = _IO_iter_begin (); i != _IO_iter_end (); i = _IO_iter_next (i))
> +    _IO_iter_file (i)->_flags2 |= _IO_FLAGS2_NEED_LOCK;
> +}
> +libc_hidden_def (_IO_enable_locks)
> +
>  void
>  _IO_old_init (_IO_FILE *fp, int flags)
>  {
>    fp->_flags = _IO_MAGIC|flags;
>    fp->_flags2 = 0;
> +  if (stdio_needs_locking)
> +    fp->_flags2 |= _IO_FLAGS2_NEED_LOCK;
>    fp->_IO_buf_base = NULL;
>    fp->_IO_buf_end = NULL;
>    fp->_IO_read_base = NULL;
> diff --git a/libio/getc.c b/libio/getc.c
> index b58fd62308..fd66ef93cf 100644
> --- a/libio/getc.c
> +++ b/libio/getc.c
> @@ -34,6 +34,8 @@ _IO_getc (FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_getc_unlocked (fp);
>    _IO_acquire_lock (fp);
>    result = _IO_getc_unlocked (fp);
>    _IO_release_lock (fp);
> diff --git a/libio/getchar.c b/libio/getchar.c
> index 5b41595d17..d79932114e 100644
> --- a/libio/getchar.c
> +++ b/libio/getchar.c
> @@ -33,6 +33,8 @@ int
>  getchar (void)
>  {
>    int result;
> +  if (!_IO_need_lock (_IO_stdin))
> +    return _IO_getc_unlocked (_IO_stdin);
>    _IO_acquire_lock (_IO_stdin);
>    result = _IO_getc_unlocked (_IO_stdin);
>    _IO_release_lock (_IO_stdin);
> diff --git a/libio/iofopncook.c b/libio/iofopncook.c
> index a08dfdaa42..982f464a68 100644
> --- a/libio/iofopncook.c
> +++ b/libio/iofopncook.c
> @@ -172,6 +172,8 @@ _IO_cookie_init (struct _IO_cookie_file *cfile, int read_write,
>    _IO_mask_flags (&cfile->__fp.file, read_write,
>  		  _IO_NO_READS+_IO_NO_WRITES+_IO_IS_APPENDING);
>  
> +  cfile->__fp.file._flags2 |= _IO_FLAGS2_NEED_LOCK;
> +
>    /* We use a negative number different from -1 for _fileno to mark that
>       this special stream is not associated with a real file, but still has
>       to be treated as such.  */
> diff --git a/libio/ioungetc.c b/libio/ioungetc.c
> index 951064fa12..917cad8abb 100644
> --- a/libio/ioungetc.c
> +++ b/libio/ioungetc.c
> @@ -33,6 +33,8 @@ _IO_ungetc (int c, _IO_FILE *fp)
>    CHECK_FILE (fp, EOF);
>    if (c == EOF)
>      return EOF;
> +  if (!_IO_need_lock (fp))
> +    return _IO_sputbackc (fp, (unsigned char) c);
>    _IO_acquire_lock (fp);
>    result = _IO_sputbackc (fp, (unsigned char) c);
>    _IO_release_lock (fp);
> diff --git a/libio/libio.h b/libio/libio.h
> index 518ffd8e44..14bcb92332 100644
> --- a/libio/libio.h
> +++ b/libio/libio.h
> @@ -119,6 +119,7 @@
>  # define _IO_FLAGS2_SCANF_STD 16
>  # define _IO_FLAGS2_NOCLOSE 32
>  # define _IO_FLAGS2_CLOEXEC 64
> +# define _IO_FLAGS2_NEED_LOCK 128
>  #endif
>  
>  /* These are "formatting flags" matching the iostream fmtflags enum values. */
> @@ -451,6 +452,9 @@ extern int _IO_ftrylockfile (_IO_FILE *) __THROW;
>  #define _IO_cleanup_region_end(_Doit) /**/
>  #endif
>  
> +#define _IO_need_lock(_fp) \
> +  (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0)
> +
>  extern int _IO_vfscanf (_IO_FILE * __restrict, const char * __restrict,
>  			_IO_va_list, int *__restrict);
>  extern int _IO_vfprintf (_IO_FILE *__restrict, const char *__restrict,
> diff --git a/libio/libioP.h b/libio/libioP.h
> index eb93418c8d..163dfb1386 100644
> --- a/libio/libioP.h
> +++ b/libio/libioP.h
> @@ -444,6 +444,8 @@ extern void _IO_list_unlock (void) __THROW;
>  libc_hidden_proto (_IO_list_unlock)
>  extern void _IO_list_resetlock (void) __THROW;
>  libc_hidden_proto (_IO_list_resetlock)
> +extern void _IO_enable_locks (void) __THROW;
> +libc_hidden_proto (_IO_enable_locks)
>  
>  /* Default jumptable functions. */
>  
> @@ -750,7 +752,7 @@ extern _IO_off64_t _IO_seekpos_unlocked (_IO_FILE *, _IO_off64_t, int)
>  
>  #if _G_HAVE_MMAP
>  
> -# ifdef _LIBC
> +# if IS_IN (libc)
>  /* When using this code in the GNU libc we must not pollute the name space.  */
>  #  define mmap __mmap
>  #  define munmap __munmap
> diff --git a/libio/putc.c b/libio/putc.c
> index b591c5538b..6e1fdeef3a 100644
> --- a/libio/putc.c
> +++ b/libio/putc.c
> @@ -25,6 +25,8 @@ _IO_putc (int c, _IO_FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_putc_unlocked (c, fp);
>    _IO_acquire_lock (fp);
>    result = _IO_putc_unlocked (c, fp);
>    _IO_release_lock (fp);
> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
> index c7d1b8f413..0b3fa942f2 100644
> --- a/nptl/pthread_create.c
> +++ b/nptl/pthread_create.c
> @@ -32,6 +32,7 @@
>  #include <exit-thread.h>
>  #include <default-sched.h>
>  #include <futex-internal.h>
> +#include "libioP.h"
>  
>  #include <shlib-compat.h>
>  
> @@ -756,6 +757,9 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
>          collect_default_sched (pd);
>      }
>  
> +  if (__glibc_unlikely (__nptl_nthreads == 1))
> +    _IO_enable_locks ();
> +
>    /* Pass the descriptor to the caller.  */
>    *newthread = (pthread_t) pd;
>  
> diff --git a/sysdeps/pthread/flockfile.c b/sysdeps/pthread/flockfile.c
> index 7fe8e99161..a8e6c28ed9 100644
> --- a/sysdeps/pthread/flockfile.c
> +++ b/sysdeps/pthread/flockfile.c
> @@ -25,6 +25,7 @@
>  void
>  __flockfile (FILE *stream)
>  {
> +  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;
>    _IO_lock_lock (*stream->_lock);
>  }
>  strong_alias (__flockfile, _IO_flockfile)
>
  
Siddhesh Poyarekar June 29, 2017, 12:01 p.m. UTC | #2
On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote:
> The patch looks OK except for the duplication (and a missing comment
> below), which looks a bit clumsy.  How about something like this instead:
> 
>   bool need_lock = _IO_need_lock (fp);
> 
>   if (need_lock)
>     _IO_flockfile (fp);
>   result = _IO_ferror_unlocked (fp);
>   if (need_lock)
>     _IO_funlockfile (fp);
> 
>   return result;
> 
> You could probably make some kind of a macro out of this, I haven't
> looked that hard.

I forgot that Torvald had commented (off-list, the thread broke somehow)
that it would be important to try and measure how much worse this makes
the multi-threaded case worse.

Siddhesh
  
Carlos O'Donell June 29, 2017, 12:12 p.m. UTC | #3
On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote:
> On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote:
>> The patch looks OK except for the duplication (and a missing comment
>> below), which looks a bit clumsy.  How about something like this instead:
>>
>>   bool need_lock = _IO_need_lock (fp);
>>
>>   if (need_lock)
>>     _IO_flockfile (fp);
>>   result = _IO_ferror_unlocked (fp);
>>   if (need_lock)
>>     _IO_funlockfile (fp);
>>
>>   return result;
>>
>> You could probably make some kind of a macro out of this, I haven't
>> looked that hard.
> 
> I forgot that Torvald had commented (off-list, the thread broke somehow)
> that it would be important to try and measure how much worse this makes
> the multi-threaded case worse.

+1

If we are going to optimize the single threaded case we need to know what
impact this has on the multi-threaded case.
  
Szabolcs Nagy June 29, 2017, 1:29 p.m. UTC | #4
On 29/06/17 12:41, Siddhesh Poyarekar wrote:
>> @@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp)
>>  {
>>    int result;
>>    CHECK_FILE (fp, EOF);
>> +  if (!_IO_need_lock (fp))
>> +    return _IO_ferror_unlocked (fp);
>>    _IO_flockfile (fp);
>>    result = _IO_ferror_unlocked (fp);
> 
> The patch looks OK except for the duplication (and a missing comment
> below), which looks a bit clumsy.  How about something like this instead:
> 
>   bool need_lock = _IO_need_lock (fp);
> 
>   if (need_lock)
>     _IO_flockfile (fp);
>   result = _IO_ferror_unlocked (fp);
>   if (need_lock)
>     _IO_funlockfile (fp);
> 
>   return result;
> 
> You could probably make some kind of a macro out of this, I haven't
> looked that hard.
> 

this does not do what we want:
single check + tailcall the unlocked version.

with your code gcc is more likely to do two
checks and register spills and an additional
return.

while my code does not force gcc to do the
right thing at least it shows the intent.
  
Szabolcs Nagy June 29, 2017, 1:44 p.m. UTC | #5
On 29/06/17 13:12, Carlos O'Donell wrote:
> On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote:
>> On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote:
>>> The patch looks OK except for the duplication (and a missing comment
>>> below), which looks a bit clumsy.  How about something like this instead:
>>>
>>>   bool need_lock = _IO_need_lock (fp);
>>>
>>>   if (need_lock)
>>>     _IO_flockfile (fp);
>>>   result = _IO_ferror_unlocked (fp);
>>>   if (need_lock)
>>>     _IO_funlockfile (fp);
>>>
>>>   return result;
>>>
>>> You could probably make some kind of a macro out of this, I haven't
>>> looked that hard.
>>
>> I forgot that Torvald had commented (off-list, the thread broke somehow)
>> that it would be important to try and measure how much worse this makes
>> the multi-threaded case worse.
> 
> +1
> 
> If we are going to optimize the single threaded case we need to know what
> impact this has on the multi-threaded case.
> 

note that this impacts multi-threaded case less
than the lowlevellock approach that is currently
implemented: that adds two checks, my code does
one, that loads __libc_multiple_threads twice,
mine checks a flag in fp, which is in a cache line
that is most likely already accessed by the rest
of the io code.

i cannot produce numbers immediately as the last
time i measured this, adding a dummy thread via an
ldpreloaded lib had more effect on the timing
of the same binary than adding a branch in the
stdio code (i'm not sure why the additional thread
affects timing so much with the current code, it
might be a cpu issue, e.g. cache aliasing caused
by slightly different layout of the loaded libs,
but it also shows that the effect of the patch
is small)
  
Siddhesh Poyarekar June 29, 2017, 2:06 p.m. UTC | #6
On Thursday 29 June 2017 06:59 PM, Szabolcs Nagy wrote:
> this does not do what we want:
> single check + tailcall the unlocked version.
> 
> with your code gcc is more likely to do two
> checks and register spills and an additional
> return.
> 
> while my code does not force gcc to do the
> right thing at least it shows the intent.

Can you verify this both ways?  If they don't show any noticeable
difference then I prefer the way I suggested, but that is really just
for aesthetics.  Maybe also provide a __glibc_unlikely hint to aid the
compiler.

Siddhesh
  
Szabolcs Nagy June 30, 2017, 12:15 p.m. UTC | #7
On 29/06/17 13:12, Carlos O'Donell wrote:
> On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote:
>> On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote:
>>> The patch looks OK except for the duplication (and a missing comment
>>> below), which looks a bit clumsy.  How about something like this instead:
>>>
>>>   bool need_lock = _IO_need_lock (fp);
>>>
>>>   if (need_lock)
>>>     _IO_flockfile (fp);
>>>   result = _IO_ferror_unlocked (fp);
>>>   if (need_lock)
>>>     _IO_funlockfile (fp);
>>>
>>>   return result;
>>>
>>> You could probably make some kind of a macro out of this, I haven't
>>> looked that hard.
>>
>> I forgot that Torvald had commented (off-list, the thread broke somehow)
>> that it would be important to try and measure how much worse this makes
>> the multi-threaded case worse.
> 
> +1
> 
> If we are going to optimize the single threaded case we need to know what
> impact this has on the multi-threaded case.
> 

$orig == current
$stdio == my patch
$stdio_mt == my patch but 'needs lock' flag is set so multithread path is taken

on two particular aarch64 cpus with a particular loop count:

cpu1
time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
8.08user 0.04system 0:08.12elapsed 100%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
1.07user 0.04system 0:01.11elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar
7.87user 0.00system 0:07.88elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps

cpu2
time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
8.11user 0.04system 0:08.16elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
2.29user 0.06system 0:02.35elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar
8.12user 0.03system 0:08.16elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps

on a particular x86_64 cpu with particular loop count:

time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
0inputs+0outputs (0major+153minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
0inputs+0outputs (0major+155minor)pagefaults 0swaps
time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar
6.76user 0.08system 0:06.87elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
0inputs+0outputs (0major+155minor)pagefaults 0swaps

in summary: on aarch64 i see no regression (in some case stdio_mt
become faster, can happen since the layout of the code changed)
on this particular x86 cpu stdio_mt has a close to 15% regression.

i don't believe the big regression on x86 is valid, it could
be that the benchmark just got past some cpu internal limit
or the code got aligned differently, in fact if i static link
the exact same code then on the same cpu i get

time ./getchar_static-orig
6.60user 0.05system 0:06.66elapsed 99%CPU (0avgtext+0avgdata 912maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps
time ./getchar_static-stdio
2.24user 0.08system 0:02.33elapsed 99%CPU (0avgtext+0avgdata 896maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps
time ./getchar_static-stdio_mt
6.50user 0.06system 0:06.57elapsed 99%CPU (0avgtext+0avgdata 896maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps

i.e. now it is faster from the branch! (both measurements
are repeatable)

i didn't dig into the root cause of the regression (or
why is static linking slower?), i would not be too
worried about it since the common case for hot stdio
loops is in single thread processes where even on x86
the patch gives >2x speedup.
  
Szabolcs Nagy June 30, 2017, 12:38 p.m. UTC | #8
On 29/06/17 15:06, Siddhesh Poyarekar wrote:
> On Thursday 29 June 2017 06:59 PM, Szabolcs Nagy wrote:
>> this does not do what we want:
>> single check + tailcall the unlocked version.
>>
>> with your code gcc is more likely to do two
>> checks and register spills and an additional
>> return.
>>
>> while my code does not force gcc to do the
>> right thing at least it shows the intent.
> 
> Can you verify this both ways?  If they don't show any noticeable
> difference then I prefer the way I suggested, but that is really just
> for aesthetics.  Maybe also provide a __glibc_unlikely hint to aid the
> compiler.

your code layout does not work for most
io apis as they use _IO_acquire_lock/unlock
which create a new scope for cancel cleanup
handling.

it only works with non-cancellation point
apis (flag checks: feof, ferror) that just
flockfile/funlockfile.

i can change those two functions where it
works but i dont think it will be any more
maintainable.
  
Carlos O'Donell June 30, 2017, 1:16 p.m. UTC | #9
On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
> i didn't dig into the root cause of the regression (or
> why is static linking slower?), i would not be too
> worried about it since the common case for hot stdio
> loops is in single thread processes where even on x86
> the patch gives >2x speedup.

Regardless of the cause, the 15% regression on x86 MT performance
is serious, and I see no reason to push this into glibc 2.26. 
We can add it any time in 2.27, or the distros can pick it up with
a backport.

I would like to see a better characterization of the regression before
accepting this patch.

While I agree that common case for hot stdio loops is non-MT, there
are still MT cases, and 15% is a large double-digit loss.

Have you looked at the assembly differences? What is the compiler
doing differently?

When our a user asks "Why is my MT stdio 15% slower?" We owe them an
answer that is clear and concise.
  
Adhemerval Zanella June 30, 2017, 1:21 p.m. UTC | #10
On 30/06/2017 10:16, Carlos O'Donell wrote:
> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
>> i didn't dig into the root cause of the regression (or
>> why is static linking slower?), i would not be too
>> worried about it since the common case for hot stdio
>> loops is in single thread processes where even on x86
>> the patch gives >2x speedup.
> 
> Regardless of the cause, the 15% regression on x86 MT performance
> is serious, and I see no reason to push this into glibc 2.26. 
> We can add it any time in 2.27, or the distros can pick it up with
> a backport.
> 
> I would like to see a better characterization of the regression before
> accepting this patch.
> 
> While I agree that common case for hot stdio loops is non-MT, there
> are still MT cases, and 15% is a large double-digit loss.
> 
> Have you looked at the assembly differences? What is the compiler
> doing differently?
> 
> When our a user asks "Why is my MT stdio 15% slower?" We owe them an
> answer that is clear and concise.
> 

Just a wild guess, but might some overhead due on x86_64 the internal
locks will now unnecessary check for single-thread case (the 
__lll_lock_asm_start [1] snippet). Did you try by using the simple
variant, without the check for __libc_multiple_threads?

[1] ./sysdeps/unix/sysv/linux/x86_64/lowlevellock.h
  
Szabolcs Nagy June 30, 2017, 2:39 p.m. UTC | #11
On 30/06/17 14:16, Carlos O'Donell wrote:
> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
>> i didn't dig into the root cause of the regression (or
>> why is static linking slower?), i would not be too
>> worried about it since the common case for hot stdio
>> loops is in single thread processes where even on x86
>> the patch gives >2x speedup.
> 
> Regardless of the cause, the 15% regression on x86 MT performance
> is serious, and I see no reason to push this into glibc 2.26. 
> We can add it any time in 2.27, or the distros can pick it up with
> a backport.
> 
> I would like to see a better characterization of the regression before
> accepting this patch.
> 
> While I agree that common case for hot stdio loops is non-MT, there
> are still MT cases, and 15% is a large double-digit loss.
> 
> Have you looked at the assembly differences? What is the compiler
> doing differently?
> 
> When our a user asks "Why is my MT stdio 15% slower?" We owe them an
> answer that is clear and concise.
> 

sorry the x86 measurement was bogus because only
the high level code thought it's multithreaded, the
lowlevellock code thought it's single threaded so
there were no atomic ops executed in the stdio_mt case

with atomics the orig performance is significantly
slower so the regression relative to that is small in %.

if i create a dummy thread (to measure true mt
behaviour, same loop count):

time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt
20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k
0inputs+0outputs (0major+180minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt
20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k
0inputs+0outputs (0major+179minor)pagefaults 0swaps

the relative diff is 2% now, but notice that the
abs diff went down too (which points to uarch issue
in the previous measurement).

perf stat indicates that there are 15 vs 16 branches
in the loop (so my patch indeed adds one branch
but there are plenty branches already) the instruction
count goes from 43 to 45 per loop iteration
(flag check + branch).

in my previous measurements, how can +1 branch
decrease the performance >10% when there are
already >10 branches (and several other insns)
is something the x86 uarchitects could explain.

in summary the patch trades 2% mt performance to
2x non-mt performance on this x86 cpu.
  
Siddhesh Poyarekar June 30, 2017, 3:16 p.m. UTC | #12
On Friday 30 June 2017 06:08 PM, Szabolcs Nagy wrote:
> your code layout does not work for most
> io apis as they use _IO_acquire_lock/unlock
> which create a new scope for cancel cleanup
> handling.
> 
> it only works with non-cancellation point
> apis (flag checks: feof, ferror) that just
> flockfile/funlockfile.
> 
> i can change those two functions where it
> works but i dont think it will be any more
> maintainable.

Right, that won't work.  There should be a cleaner way to do both, but I
can't think of one right now, so I'll let that go.

So the only open points right now is the comment I asked you to add
describing what/why you're doing this for stdio and the performance
question on x86.  So Carlos, etc. who are monitoring this will tell you
when they're satisfied with your updated results.

Siddhesh
  
Torvald Riegel June 30, 2017, 3:18 p.m. UTC | #13
On Fri, 2017-06-30 at 13:15 +0100, Szabolcs Nagy wrote:
> on a particular x86_64 cpu with particular loop count:
> 
> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
> 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
> 0inputs+0outputs (0major+153minor)pagefaults 0swaps
> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
> 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
> 0inputs+0outputs (0major+155minor)pagefaults 0swaps

What's interesting here is that your high-level optimization is faster
than doing the single-thread check in the low-level lock (x86 has it
already in the low-level lock).
  
Siddhesh Poyarekar June 30, 2017, 3:24 p.m. UTC | #14
On Friday 30 June 2017 08:48 PM, Torvald Riegel wrote:
> What's interesting here is that your high-level optimization is faster
> than doing the single-thread check in the low-level lock (x86 has it
> already in the low-level lock).

Cache locality might be a factor - the fp struct may already be in cache
as opposed to the TLS block from which one gets the value of
multiple_threads.

Siddhesh
  
Ramana Radhakrishnan June 30, 2017, 3:32 p.m. UTC | #15
On Fri, Jun 30, 2017 at 4:18 PM, Torvald Riegel <triegel@redhat.com> wrote:
> On Fri, 2017-06-30 at 13:15 +0100, Szabolcs Nagy wrote:
>> on a particular x86_64 cpu with particular loop count:
>>
>> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
>> 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
>> 0inputs+0outputs (0major+153minor)pagefaults 0swaps
>> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
>> 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
>> 0inputs+0outputs (0major+155minor)pagefaults 0swaps
>
> What's interesting here is that your high-level optimization is faster
> than doing the single-thread check in the low-level lock (x86 has it
> already in the low-level lock).

Isn't that the expected behaviour here ?

Ramana



>
  
Torvald Riegel June 30, 2017, 3:34 p.m. UTC | #16
On Fri, 2017-06-30 at 20:54 +0530, Siddhesh Poyarekar wrote:
> On Friday 30 June 2017 08:48 PM, Torvald Riegel wrote:
> > What's interesting here is that your high-level optimization is faster
> > than doing the single-thread check in the low-level lock (x86 has it
> > already in the low-level lock).
> 
> Cache locality might be a factor - the fp struct may already be in cache
> as opposed to the TLS block from which one gets the value of
> multiple_threads.

And the check can happen earlier in the code, and at a higher level.
  
Szabolcs Nagy June 30, 2017, 3:51 p.m. UTC | #17
On 30/06/17 16:18, Torvald Riegel wrote:
> On Fri, 2017-06-30 at 13:15 +0100, Szabolcs Nagy wrote:
>> on a particular x86_64 cpu with particular loop count:
>>
>> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
>> 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
>> 0inputs+0outputs (0major+153minor)pagefaults 0swaps
>> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
>> 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
>> 0inputs+0outputs (0major+155minor)pagefaults 0swaps
> 
> What's interesting here is that your high-level optimization is faster
> than doing the single-thread check in the low-level lock (x86 has it
> already in the low-level lock).
> 

musl implemented stdio this way since 2011, which
is the reason for the 2x in getc/putc (i386):
http://www.etalabs.net/compare_libcs.html
we knew this already, but pushing a high level opt
in glibc is harder than doing target specific lll opt.

musl does not have to worry about fopencookie, libc
internal malloc interopsition, printf hooks, magic IO
flags, target specific hacks, so the optimization is
more obviously correct there, but now that the patch
only omits locks in specific functions the correctness
should be easier to verify.
  
Carlos O'Donell June 30, 2017, 4:07 p.m. UTC | #18
On 06/30/2017 10:39 AM, Szabolcs Nagy wrote:
> On 30/06/17 14:16, Carlos O'Donell wrote:
>> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
>>> i didn't dig into the root cause of the regression (or
>>> why is static linking slower?), i would not be too
>>> worried about it since the common case for hot stdio
>>> loops is in single thread processes where even on x86
>>> the patch gives >2x speedup.
>>
>> Regardless of the cause, the 15% regression on x86 MT performance
>> is serious, and I see no reason to push this into glibc 2.26. 
>> We can add it any time in 2.27, or the distros can pick it up with
>> a backport.
>>
>> I would like to see a better characterization of the regression before
>> accepting this patch.
>>
>> While I agree that common case for hot stdio loops is non-MT, there
>> are still MT cases, and 15% is a large double-digit loss.
>>
>> Have you looked at the assembly differences? What is the compiler
>> doing differently?
>>
>> When our a user asks "Why is my MT stdio 15% slower?" We owe them an
>> answer that is clear and concise.
>>
> 
> sorry the x86 measurement was bogus because only
> the high level code thought it's multithreaded, the
> lowlevellock code thought it's single threaded so
> there were no atomic ops executed in the stdio_mt case

OK.

> with atomics the orig performance is significantly
> slower so the regression relative to that is small in %.
> 
> if i create a dummy thread (to measure true mt
> behaviour, same loop count):
> 
> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt
> 20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k
> 0inputs+0outputs (0major+180minor)pagefaults 0swaps
> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt
> 20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k
> 0inputs+0outputs (0major+179minor)pagefaults 0swaps
> 
> the relative diff is 2% now, but notice that the
> abs diff went down too (which points to uarch issue
> in the previous measurement).

OK. This is much better.

> perf stat indicates that there are 15 vs 16 branches
> in the loop (so my patch indeed adds one branch
> but there are plenty branches already) the instruction
> count goes from 43 to 45 per loop iteration
> (flag check + branch).
> 
> in my previous measurements, how can +1 branch
> decrease the performance >10% when there are
> already >10 branches (and several other insns)
> is something the x86 uarchitects could explain.
> 
> in summary the patch trades 2% mt performance to
> 2x non-mt performance on this x86 cpu.
 
Excellent, this is exactly the analysis I was looking for, and this kind
of result is something that can make sense to our users.

I'm OK with the patch for 2.26.
  

Patch

diff --git a/libio/Versions b/libio/Versions
index 2a1d6e6c85..6c1624e8cd 100644
--- a/libio/Versions
+++ b/libio/Versions
@@ -155,5 +155,6 @@  libc {
   GLIBC_PRIVATE {
     # Used by NPTL and librt
     __libc_fatal;
+    _IO_enable_locks;
   }
 }
diff --git a/libio/feof.c b/libio/feof.c
index 9712a81e78..8890a5f51f 100644
--- a/libio/feof.c
+++ b/libio/feof.c
@@ -32,6 +32,8 @@  _IO_feof (_IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_feof_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_feof_unlocked (fp);
   _IO_funlockfile (fp);
diff --git a/libio/ferror.c b/libio/ferror.c
index 01e3bd8e2b..d10fcd9fff 100644
--- a/libio/ferror.c
+++ b/libio/ferror.c
@@ -32,6 +32,8 @@  _IO_ferror (_IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_ferror_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_ferror_unlocked (fp);
   _IO_funlockfile (fp);
diff --git a/libio/fputc.c b/libio/fputc.c
index a7cd682fe2..b72305c06f 100644
--- a/libio/fputc.c
+++ b/libio/fputc.c
@@ -32,6 +32,8 @@  fputc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/libio/genops.c b/libio/genops.c
index a466cfa337..132f1f1a7d 100644
--- a/libio/genops.c
+++ b/libio/genops.c
@@ -570,11 +570,28 @@  _IO_init (_IO_FILE *fp, int flags)
   _IO_init_internal (fp, flags);
 }
 
+static int stdio_needs_locking;
+
+void
+_IO_enable_locks (void)
+{
+  _IO_ITER i;
+
+  if (stdio_needs_locking)
+    return;
+  stdio_needs_locking = 1;
+  for (i = _IO_iter_begin (); i != _IO_iter_end (); i = _IO_iter_next (i))
+    _IO_iter_file (i)->_flags2 |= _IO_FLAGS2_NEED_LOCK;
+}
+libc_hidden_def (_IO_enable_locks)
+
 void
 _IO_old_init (_IO_FILE *fp, int flags)
 {
   fp->_flags = _IO_MAGIC|flags;
   fp->_flags2 = 0;
+  if (stdio_needs_locking)
+    fp->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   fp->_IO_buf_base = NULL;
   fp->_IO_buf_end = NULL;
   fp->_IO_read_base = NULL;
diff --git a/libio/getc.c b/libio/getc.c
index b58fd62308..fd66ef93cf 100644
--- a/libio/getc.c
+++ b/libio/getc.c
@@ -34,6 +34,8 @@  _IO_getc (FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_getc_unlocked (fp);
   _IO_acquire_lock (fp);
   result = _IO_getc_unlocked (fp);
   _IO_release_lock (fp);
diff --git a/libio/getchar.c b/libio/getchar.c
index 5b41595d17..d79932114e 100644
--- a/libio/getchar.c
+++ b/libio/getchar.c
@@ -33,6 +33,8 @@  int
 getchar (void)
 {
   int result;
+  if (!_IO_need_lock (_IO_stdin))
+    return _IO_getc_unlocked (_IO_stdin);
   _IO_acquire_lock (_IO_stdin);
   result = _IO_getc_unlocked (_IO_stdin);
   _IO_release_lock (_IO_stdin);
diff --git a/libio/iofopncook.c b/libio/iofopncook.c
index a08dfdaa42..982f464a68 100644
--- a/libio/iofopncook.c
+++ b/libio/iofopncook.c
@@ -172,6 +172,8 @@  _IO_cookie_init (struct _IO_cookie_file *cfile, int read_write,
   _IO_mask_flags (&cfile->__fp.file, read_write,
 		  _IO_NO_READS+_IO_NO_WRITES+_IO_IS_APPENDING);
 
+  cfile->__fp.file._flags2 |= _IO_FLAGS2_NEED_LOCK;
+
   /* We use a negative number different from -1 for _fileno to mark that
      this special stream is not associated with a real file, but still has
      to be treated as such.  */
diff --git a/libio/ioungetc.c b/libio/ioungetc.c
index 951064fa12..917cad8abb 100644
--- a/libio/ioungetc.c
+++ b/libio/ioungetc.c
@@ -33,6 +33,8 @@  _IO_ungetc (int c, _IO_FILE *fp)
   CHECK_FILE (fp, EOF);
   if (c == EOF)
     return EOF;
+  if (!_IO_need_lock (fp))
+    return _IO_sputbackc (fp, (unsigned char) c);
   _IO_acquire_lock (fp);
   result = _IO_sputbackc (fp, (unsigned char) c);
   _IO_release_lock (fp);
diff --git a/libio/libio.h b/libio/libio.h
index 518ffd8e44..14bcb92332 100644
--- a/libio/libio.h
+++ b/libio/libio.h
@@ -119,6 +119,7 @@ 
 # define _IO_FLAGS2_SCANF_STD 16
 # define _IO_FLAGS2_NOCLOSE 32
 # define _IO_FLAGS2_CLOEXEC 64
+# define _IO_FLAGS2_NEED_LOCK 128
 #endif
 
 /* These are "formatting flags" matching the iostream fmtflags enum values. */
@@ -451,6 +452,9 @@  extern int _IO_ftrylockfile (_IO_FILE *) __THROW;
 #define _IO_cleanup_region_end(_Doit) /**/
 #endif
 
+#define _IO_need_lock(_fp) \
+  (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0)
+
 extern int _IO_vfscanf (_IO_FILE * __restrict, const char * __restrict,
 			_IO_va_list, int *__restrict);
 extern int _IO_vfprintf (_IO_FILE *__restrict, const char *__restrict,
diff --git a/libio/libioP.h b/libio/libioP.h
index eb93418c8d..163dfb1386 100644
--- a/libio/libioP.h
+++ b/libio/libioP.h
@@ -444,6 +444,8 @@  extern void _IO_list_unlock (void) __THROW;
 libc_hidden_proto (_IO_list_unlock)
 extern void _IO_list_resetlock (void) __THROW;
 libc_hidden_proto (_IO_list_resetlock)
+extern void _IO_enable_locks (void) __THROW;
+libc_hidden_proto (_IO_enable_locks)
 
 /* Default jumptable functions. */
 
@@ -750,7 +752,7 @@  extern _IO_off64_t _IO_seekpos_unlocked (_IO_FILE *, _IO_off64_t, int)
 
 #if _G_HAVE_MMAP
 
-# ifdef _LIBC
+# if IS_IN (libc)
 /* When using this code in the GNU libc we must not pollute the name space.  */
 #  define mmap __mmap
 #  define munmap __munmap
diff --git a/libio/putc.c b/libio/putc.c
index b591c5538b..6e1fdeef3a 100644
--- a/libio/putc.c
+++ b/libio/putc.c
@@ -25,6 +25,8 @@  _IO_putc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index c7d1b8f413..0b3fa942f2 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -32,6 +32,7 @@ 
 #include <exit-thread.h>
 #include <default-sched.h>
 #include <futex-internal.h>
+#include "libioP.h"
 
 #include <shlib-compat.h>
 
@@ -756,6 +757,9 @@  __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
         collect_default_sched (pd);
     }
 
+  if (__glibc_unlikely (__nptl_nthreads == 1))
+    _IO_enable_locks ();
+
   /* Pass the descriptor to the caller.  */
   *newthread = (pthread_t) pd;
 
diff --git a/sysdeps/pthread/flockfile.c b/sysdeps/pthread/flockfile.c
index 7fe8e99161..a8e6c28ed9 100644
--- a/sysdeps/pthread/flockfile.c
+++ b/sysdeps/pthread/flockfile.c
@@ -25,6 +25,7 @@ 
 void
 __flockfile (FILE *stream)
 {
+  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   _IO_lock_lock (*stream->_lock);
 }
 strong_alias (__flockfile, _IO_flockfile)