[RFC] Single threaded stdio optimization

Message ID 591B1DE9.90805@arm.com
State New, archived
Headers

Commit Message

Szabolcs Nagy May 16, 2017, 3:42 p.m. UTC
  Locking overhead can be significant in some stdio operations
which are common in single threaded applications so it makes
sense to optimize that case.

there are two basic approaches:
1) avoid the use of expensive atomics in the lowlevellock code
while there is only one thread (x86 already implements this).
2) jump to the _unlocked variant of an stdio function in case
the file needs no locking (the process will be single threaded
until the end of the stdio function).

this patch is an incomplete implementation of 2) which is target
independent and improves stdio performance more (however it does
not affect lock usage in malloc for example).

issues not tackled:

- files with _IO_USER_LOCK flag set could use the same mechanism
which would mean less checks.
- malloc interposition is not handled yet. whenever (non-as-safe)
user code may run between flockfile and funlockfile the optimization
must be disabled in case a thread is created, i just don't know
what's the best way to detect malloc interposition at libc startup.
- i used a new libc symbol (_IO_enable_locks) that pthread_create
can call to enable the stdio locks, there might be a better way.
(abilists are not updated yet).
- stdio has various configurations that i did not test (non-linux
or non-multi-threaded setups).

my question is if this approach is acceptable or if the target
specific lowlevellock optimization (like x86 does it) preferred.

(a further performance improvement is possible if the flag check
is inlined in user code, but then the flag bit becomes abi.)

2017-05-16  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	* libio/libio.h (_IO_FLAGS2_NEED_LOCK, _IO_need_lock): Define.
	* libio/libioP.h (_IO_enable_locks): Declare.
	* libio/Versions (_IO_enable_locks): New symbol.
	* libio/genops.c (_IO_enable_locks): Define.
	(_IO_old_init): Initialize flags2.
	* include/libio.h (_IO_flockfile): Avoid locking when possible.
	(_IO_funlockfile): Likewise.
	* libio/fputc.c (fputc): Likewise.
	* libio/putc.c (_IO_putc): Likewise.
	* libio/getc.c (_IO_getc): Likewise.
	* libio/getchar.c (getchar): Likewise
	* nptl/pthread_create.c (__pthread_create_2_1): Enable stdio locks.
	* stdio-common/reg-printf.c (__register_printf_specifier): Likewise.
	* stdio-common/reg-type.c (__register_printf_type): Likwise.
	* libio/iofopncook.c (_IO_fopencookie): Enable locking for the file.
	* sysdeps/pthread/flockfile.c (__flockfile): Likewise.
  

Comments

Torvald Riegel May 30, 2017, 7:01 a.m. UTC | #1
On Tue, 2017-05-16 at 16:42 +0100, Szabolcs Nagy wrote:
> Locking overhead can be significant in some stdio operations
> which are common in single threaded applications so it makes
> sense to optimize that case.
> 
> there are two basic approaches:
> 1) avoid the use of expensive atomics in the lowlevellock code
> while there is only one thread (x86 already implements this).
> 2) jump to the _unlocked variant of an stdio function in case
> the file needs no locking (the process will be single threaded
> until the end of the stdio function).
> 
> this patch is an incomplete implementation of 2) which is target
> independent and improves stdio performance more (however it does
> not affect lock usage in malloc for example).
> 
> issues not tackled:
> 
> - files with _IO_USER_LOCK flag set could use the same mechanism
> which would mean less checks.
> - malloc interposition is not handled yet. whenever (non-as-safe)
> user code may run between flockfile and funlockfile the optimization
> must be disabled in case a thread is created, i just don't know
> what's the best way to detect malloc interposition at libc startup.
> - i used a new libc symbol (_IO_enable_locks) that pthread_create
> can call to enable the stdio locks, there might be a better way.
> (abilists are not updated yet).
> - stdio has various configurations that i did not test (non-linux
> or non-multi-threaded setups).
> 
> my question is if this approach is acceptable or if the target
> specific lowlevellock optimization (like x86 does it) preferred.

I think approach 2) is much better.  If we want to avoid synchronization
costs, we should try to do it at the highest level so that we can also
optimize the users of synchronization (eg, just for illustrative
purposes, avoiding both a lock and unlock call just with one branch).

Once we've implemented that approach for all clients of lowlevellock
where this might matter, I'd also remove the x86-specific optimization
(at which point we could remove the x86-specific lll implementation too,
I believe, and use the generic one).  For example, do something similar
in malloc (or do something elsethat  makes it unlikely that
synchronization is used if that's not necessary). In contrast, in
pthread_mutex*, IMO we can assume that the caller wouldn't use these
unless the program really is multi-threaded; I know others have voiced
other opinions, but even if don't want to assume that optimizing for
single-threaded use in pthread_mutex_* would still be better than doing
it in the low-level lock.
  

Patch

diff --git a/include/libio.h b/include/libio.h
index d2fa796758..6d23f5f337 100644
--- a/include/libio.h
+++ b/include/libio.h
@@ -30,14 +30,14 @@  libc_hidden_proto (_IO_vfscanf)
 # define _IO_peekc(_fp) _IO_peekc_locked (_fp)
 # if _IO_lock_inexpensive
 #  define _IO_flockfile(_fp) \
-  if (((_fp)->_flags & _IO_USER_LOCK) == 0) _IO_lock_lock (*(_fp)->_lock)
+  if (_IO_need_lock(_fp) && ((_fp)->_flags & _IO_USER_LOCK) == 0) _IO_lock_lock (*(_fp)->_lock)
 #  define _IO_funlockfile(_fp) \
-  if (((_fp)->_flags & _IO_USER_LOCK) == 0) _IO_lock_unlock (*(_fp)->_lock)
+  if (_IO_need_lock(_fp) && ((_fp)->_flags & _IO_USER_LOCK) == 0) _IO_lock_unlock (*(_fp)->_lock)
 # else
 #  define _IO_flockfile(_fp) \
-  if (((_fp)->_flags & _IO_USER_LOCK) == 0) _IO_flockfile (_fp)
+  if (_IO_need_lock(_fp) && ((_fp)->_flags & _IO_USER_LOCK) == 0) _IO_flockfile (_fp)
 #  define _IO_funlockfile(_fp) \
-  if (((_fp)->_flags & _IO_USER_LOCK) == 0) _IO_funlockfile (_fp)
+  if (_IO_need_lock(_fp) && ((_fp)->_flags & _IO_USER_LOCK) == 0) _IO_funlockfile (_fp)
 # endif
 #endif /* _IO_MTSAFE_IO */
 
diff --git a/libio/Versions b/libio/Versions
index 2a1d6e6c85..a22285ab53 100644
--- a/libio/Versions
+++ b/libio/Versions
@@ -152,6 +152,9 @@  libc {
     # f*
     fmemopen;
   }
+  GLIBC_2.25 {
+    _IO_enable_locks;
+  }
   GLIBC_PRIVATE {
     # Used by NPTL and librt
     __libc_fatal;
diff --git a/libio/fputc.c b/libio/fputc.c
index a7cd682fe2..b72305c06f 100644
--- a/libio/fputc.c
+++ b/libio/fputc.c
@@ -32,6 +32,8 @@  fputc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/libio/genops.c b/libio/genops.c
index a466cfa337..0cbf7e2eaf 100644
--- a/libio/genops.c
+++ b/libio/genops.c
@@ -570,11 +570,28 @@  _IO_init (_IO_FILE *fp, int flags)
   _IO_init_internal (fp, flags);
 }
 
+static int stdio_needs_locking;
+
+void
+_IO_enable_locks (void)
+{
+  _IO_ITER i;
+
+  if (stdio_needs_locking)
+    return;
+  stdio_needs_locking = 1;
+  for (i = _IO_iter_begin(); i != _IO_iter_end(); i = _IO_iter_next(i))
+    _IO_iter_file(i)->_flags2 |= _IO_FLAGS2_NEED_LOCK;
+}
+libc_hidden_def (_IO_enable_locks)
+
 void
 _IO_old_init (_IO_FILE *fp, int flags)
 {
   fp->_flags = _IO_MAGIC|flags;
   fp->_flags2 = 0;
+  if (stdio_needs_locking)
+    fp->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   fp->_IO_buf_base = NULL;
   fp->_IO_buf_end = NULL;
   fp->_IO_read_base = NULL;
diff --git a/libio/getc.c b/libio/getc.c
index b58fd62308..fd66ef93cf 100644
--- a/libio/getc.c
+++ b/libio/getc.c
@@ -34,6 +34,8 @@  _IO_getc (FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_getc_unlocked (fp);
   _IO_acquire_lock (fp);
   result = _IO_getc_unlocked (fp);
   _IO_release_lock (fp);
diff --git a/libio/getchar.c b/libio/getchar.c
index 5b41595d17..d79932114e 100644
--- a/libio/getchar.c
+++ b/libio/getchar.c
@@ -33,6 +33,8 @@  int
 getchar (void)
 {
   int result;
+  if (!_IO_need_lock (_IO_stdin))
+    return _IO_getc_unlocked (_IO_stdin);
   _IO_acquire_lock (_IO_stdin);
   result = _IO_getc_unlocked (_IO_stdin);
   _IO_release_lock (_IO_stdin);
diff --git a/libio/iofopncook.c b/libio/iofopncook.c
index a08dfdaa42..982f464a68 100644
--- a/libio/iofopncook.c
+++ b/libio/iofopncook.c
@@ -172,6 +172,8 @@  _IO_cookie_init (struct _IO_cookie_file *cfile, int read_write,
   _IO_mask_flags (&cfile->__fp.file, read_write,
 		  _IO_NO_READS+_IO_NO_WRITES+_IO_IS_APPENDING);
 
+  cfile->__fp.file._flags2 |= _IO_FLAGS2_NEED_LOCK;
+
   /* We use a negative number different from -1 for _fileno to mark that
      this special stream is not associated with a real file, but still has
      to be treated as such.  */
diff --git a/libio/libio.h b/libio/libio.h
index 518ffd8e44..14bcb92332 100644
--- a/libio/libio.h
+++ b/libio/libio.h
@@ -119,6 +119,7 @@ 
 # define _IO_FLAGS2_SCANF_STD 16
 # define _IO_FLAGS2_NOCLOSE 32
 # define _IO_FLAGS2_CLOEXEC 64
+# define _IO_FLAGS2_NEED_LOCK 128
 #endif
 
 /* These are "formatting flags" matching the iostream fmtflags enum values. */
@@ -451,6 +452,9 @@  extern int _IO_ftrylockfile (_IO_FILE *) __THROW;
 #define _IO_cleanup_region_end(_Doit) /**/
 #endif
 
+#define _IO_need_lock(_fp) \
+  (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0)
+
 extern int _IO_vfscanf (_IO_FILE * __restrict, const char * __restrict,
 			_IO_va_list, int *__restrict);
 extern int _IO_vfprintf (_IO_FILE *__restrict, const char *__restrict,
diff --git a/libio/libioP.h b/libio/libioP.h
index eb93418c8d..1832b44cc7 100644
--- a/libio/libioP.h
+++ b/libio/libioP.h
@@ -444,6 +444,8 @@  extern void _IO_list_unlock (void) __THROW;
 libc_hidden_proto (_IO_list_unlock)
 extern void _IO_list_resetlock (void) __THROW;
 libc_hidden_proto (_IO_list_resetlock)
+extern void _IO_enable_locks (void) __THROW;
+libc_hidden_proto (_IO_enable_locks)
 
 /* Default jumptable functions. */
 
diff --git a/libio/putc.c b/libio/putc.c
index b591c5538b..6e1fdeef3a 100644
--- a/libio/putc.c
+++ b/libio/putc.c
@@ -25,6 +25,8 @@  _IO_putc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index c7d1b8f413..7ffecd9c3e 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -33,6 +33,11 @@ 
 #include <default-sched.h>
 #include <futex-internal.h>
 
+/* hack */
+#include "libioP.h"
+#undef mmap
+#undef munmap
+
 #include <shlib-compat.h>
 
 #include <stap-probe.h>
@@ -756,6 +761,9 @@  __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
         collect_default_sched (pd);
     }
 
+  if (__glibc_unlikely (__nptl_nthreads == 1))
+    _IO_enable_locks ();
+
   /* Pass the descriptor to the caller.  */
   *newthread = (pthread_t) pd;
 
diff --git a/stdio-common/reg-printf.c b/stdio-common/reg-printf.c
index cbb9307795..92ab2b2d07 100644
--- a/stdio-common/reg-printf.c
+++ b/stdio-common/reg-printf.c
@@ -21,6 +21,7 @@ 
 #include <stddef.h>
 #include <stdlib.h>
 #include <libc-lock.h>
+#include "libioP.h"
 
 
 /* Array of functions indexed by format character.  */
@@ -67,6 +68,8 @@  __register_printf_specifier (int spec, printf_function converter,
   __printf_function_table[spec] = converter;
   __printf_arginfo_table[spec] = arginfo;
 
+  _IO_enable_locks ();
+
  out:
   __libc_lock_unlock (lock);
 
diff --git a/stdio-common/reg-type.c b/stdio-common/reg-type.c
index cc8952754a..083233d1b0 100644
--- a/stdio-common/reg-type.c
+++ b/stdio-common/reg-type.c
@@ -19,6 +19,7 @@ 
 #include <printf.h>
 #include <stdlib.h>
 #include <libc-lock.h>
+#include "libioP.h"
 
 
 /* Array of functions indexed by format character.  */
@@ -53,6 +54,8 @@  __register_printf_type (printf_va_arg_function fct)
       __printf_va_arg_table[result - PA_LAST] = fct;
     }
 
+  _IO_enable_locks ();
+
  out:
   __libc_lock_unlock (lock);
 
diff --git a/sysdeps/pthread/flockfile.c b/sysdeps/pthread/flockfile.c
index 7fe8e99161..a8e6c28ed9 100644
--- a/sysdeps/pthread/flockfile.c
+++ b/sysdeps/pthread/flockfile.c
@@ -25,6 +25,7 @@ 
 void
 __flockfile (FILE *stream)
 {
+  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   _IO_lock_lock (*stream->_lock);
 }
 strong_alias (__flockfile, _IO_flockfile)