gdb: Don't drop SIGSTOP during stop_all_threads

Message ID 20180518105654.6932-1-andrew.burgess@embecosm.com
State New, archived
Headers

Commit Message

Andrew Burgess May 18, 2018, 10:56 a.m. UTC
  This patch fixes an issues where GDB would sometimes hang when attaching
to a multi-threaded process.  This issues was especially likely to
trigger if the machine (running the inferior) was under load.

In summary, the problem is an imbalance between two functions in
linux-nat.c, stop_callback and stop_wait_callback.  In stop_callback we
send SIGSTOP to a thread, but _only_ if the thread is not already
stopped, and if it is not signalled, which means it should stop soon.
In stop_wait_callback we wait for the SIGSTOP to arrive, however, we are
aware that the thread might have been signalled for some other reason,
and so if a signal other than SIGSTOP causes the thread to stop then we
stash that signal away so it can be reported back later.  If we get a
SIGSTOP then this is discarded, after all, this signal was sent from
stop_callback.  Except that this might not be the case, it could be that
SIGSTOP was sent to a thread from elsewhere in GDB, in which case we
would not have sent another SIGSTOP from stop_callback and the SIGSTOP
received in stop_wait_callback should not be ignored.

Below I've laid out the exact sequence of events that I saw that lead me
to track down the above diagnosis.

After attaching to the inferior GDB sends a SIGSTOP to all of the
threads and then returns to the event loop waiting for interesting
things to happen.

Eventually the first target event is detected (this will be the first
SIGSTOP arriving) and GDB calls inferior_event_handler which calls
fetch_inferior_event.  Inside fetch_inferior_event GDB calls
do_target_wait which calls target_wait to find a thread with an event.

The target_wait call ends up in linux_nat_wait_1, which first checks to
see if any threads already have stashed stop events to report, and if
there are non then we enter a loop fetching as many events as possible
out of the kernel.  This event fetching is non-blocking, and we give up
once the kernel has no more events ready to give us.

All of the events from the kernel are passed through
linux_nat_filter_event which stashes the wait status for all of the
threads that reported a SIGSTOP, these will be returned by future calls
to linux_nat_wait_1.

Lets assume for a moment that we've attached to a multi-threaded
inferior, and that all but one thread has reported its stop during the
initial wait call in linux_nat_wait_1.  The other thread will be
reporting a SIGSTOP, but the kernel has not yet managed to deliver that
signal to GDB before GDB gave up waiting and continued handling the
events it already had.  GDB selects one of the threads that has reported
a SIGSTOP and passes this thread ID back to fetch_inferior_event.

To handle the thread's SIGSTOP, GDB calls handle_signal_stop, which
calls stop_all_threads, this calls wait_one, which in turn calls
target_wait.

The first call to target_wait at this point will result in a stashed
wait status being returned, at which point we call setup_inferior.  The
call to setup_inferior leads to a call into try_thread_db_load_1 which
results in a call to linux_stop_and_wait_all_lwps.  This in turn calls
stop_callback on each thread followed by stop_wait_callback on each
thread.

We're now ready to make the mistake.  In stop_callback we see that our
problem thread is not stopped, but is signalled, so it should stop soon.
As a result we don't send another SIGSTOP.

We then enter stop_wait_callback, eventually the problem thread stops
with SIGSTOP which we _incorrectly_ assume came from stop_callback, and
we discard.

Once stop_wait_callback has done its damage we return from
linux_stop_and_wait_all_lwps, finish in try_thread_db_load_1, and
eventually unwind back to the call to setup_inferior in
stop_all_threads.  GDB now loops around, and performs another
target_wait to get the next event from the inferior.

The target_wait calls causes us to once again reach linux_nat_wait_1,
and we pass through some code that calls resume_stopped_resumed_lwps.
This allows GDB to resume threads that are physically stopped, but which
GDB doesn't see any good reason for the thread to remain stopped.  In
our case, the problem thread which had its SIGSTOP discarded is stopped,
but doesn't have a stashed wait status to report, and so GDB sets the
thread going again.

We are now stuck waiting for an event on the problem thread that might
never arrive.

When considering how to write a test for this bug I struggled.  The
issue was only spotted _randomly_ when a machine was heavily loaded with
many multi-threaded applications, and GDB was being attached (by script)
to all of these applications in parallel.  In one reproducer I required
around 5 applications each of 5 threads per machine core in order to
reproduce the bug 2 out of 3 times.

What we really want to do though is simulate the kernel being slow to
report events through waitpid during the initial attach.  The solution I
came up with was to write an LD_PRELOAD library that intercepts (some)
waitpid calls and rate limits them to one per-second.  Any more than
that simply return 0 indicating there's no event available.  Obviously
this can only be applied to waitpid calls that have the WNOHANG flag
set.

Unfortunately, this library does break the "rest" of GDB, I suspect that
the issue is that in some places GDB knows that there's an event
pending, calls non-blocking waitpid and when the event fails to arrive
GDB moves into some unexpected/broken state from which it can't recover.
Still, the preload library does, at the moment, trigger the bug during
attach.

gdb/ChangeLog:

	* linux-nat.c (stop_wait_callback): Don't discard SIGSTOP if it
	was requested by GDB.

gdb/testsuite/ChangeLog:

	* gdb.threads/attach-slow-waitpid.c: New file.
	* gdb.threads/attach-slow-waitpid.exp: New file.
	* gdb.threads/slow-waitpid.c: New file.
---
 gdb/ChangeLog                                     |   6 ++
 gdb/linux-nat.c                                   |  14 ++-
 gdb/testsuite/ChangeLog                           |   7 ++
 gdb/testsuite/gdb.threads/attach-slow-waitpid.c   |  77 ++++++++++++++
 gdb/testsuite/gdb.threads/attach-slow-waitpid.exp |  95 ++++++++++++++++++
 gdb/testsuite/gdb.threads/slow-waitpid.c          | 117 ++++++++++++++++++++++
 6 files changed, 312 insertions(+), 4 deletions(-)
 create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.c
 create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
 create mode 100644 gdb/testsuite/gdb.threads/slow-waitpid.c
  

Comments

Andrew Burgess June 11, 2018, 12:55 p.m. UTC | #1
Ping.

Thanks,
Andrewx

* Andrew Burgess <andrew.burgess@embecosm.com> [2018-05-18 11:56:54 +0100]:

> This patch fixes an issues where GDB would sometimes hang when attaching
> to a multi-threaded process.  This issues was especially likely to
> trigger if the machine (running the inferior) was under load.
> 
> In summary, the problem is an imbalance between two functions in
> linux-nat.c, stop_callback and stop_wait_callback.  In stop_callback we
> send SIGSTOP to a thread, but _only_ if the thread is not already
> stopped, and if it is not signalled, which means it should stop soon.
> In stop_wait_callback we wait for the SIGSTOP to arrive, however, we are
> aware that the thread might have been signalled for some other reason,
> and so if a signal other than SIGSTOP causes the thread to stop then we
> stash that signal away so it can be reported back later.  If we get a
> SIGSTOP then this is discarded, after all, this signal was sent from
> stop_callback.  Except that this might not be the case, it could be that
> SIGSTOP was sent to a thread from elsewhere in GDB, in which case we
> would not have sent another SIGSTOP from stop_callback and the SIGSTOP
> received in stop_wait_callback should not be ignored.
> 
> Below I've laid out the exact sequence of events that I saw that lead me
> to track down the above diagnosis.
> 
> After attaching to the inferior GDB sends a SIGSTOP to all of the
> threads and then returns to the event loop waiting for interesting
> things to happen.
> 
> Eventually the first target event is detected (this will be the first
> SIGSTOP arriving) and GDB calls inferior_event_handler which calls
> fetch_inferior_event.  Inside fetch_inferior_event GDB calls
> do_target_wait which calls target_wait to find a thread with an event.
> 
> The target_wait call ends up in linux_nat_wait_1, which first checks to
> see if any threads already have stashed stop events to report, and if
> there are non then we enter a loop fetching as many events as possible
> out of the kernel.  This event fetching is non-blocking, and we give up
> once the kernel has no more events ready to give us.
> 
> All of the events from the kernel are passed through
> linux_nat_filter_event which stashes the wait status for all of the
> threads that reported a SIGSTOP, these will be returned by future calls
> to linux_nat_wait_1.
> 
> Lets assume for a moment that we've attached to a multi-threaded
> inferior, and that all but one thread has reported its stop during the
> initial wait call in linux_nat_wait_1.  The other thread will be
> reporting a SIGSTOP, but the kernel has not yet managed to deliver that
> signal to GDB before GDB gave up waiting and continued handling the
> events it already had.  GDB selects one of the threads that has reported
> a SIGSTOP and passes this thread ID back to fetch_inferior_event.
> 
> To handle the thread's SIGSTOP, GDB calls handle_signal_stop, which
> calls stop_all_threads, this calls wait_one, which in turn calls
> target_wait.
> 
> The first call to target_wait at this point will result in a stashed
> wait status being returned, at which point we call setup_inferior.  The
> call to setup_inferior leads to a call into try_thread_db_load_1 which
> results in a call to linux_stop_and_wait_all_lwps.  This in turn calls
> stop_callback on each thread followed by stop_wait_callback on each
> thread.
> 
> We're now ready to make the mistake.  In stop_callback we see that our
> problem thread is not stopped, but is signalled, so it should stop soon.
> As a result we don't send another SIGSTOP.
> 
> We then enter stop_wait_callback, eventually the problem thread stops
> with SIGSTOP which we _incorrectly_ assume came from stop_callback, and
> we discard.
> 
> Once stop_wait_callback has done its damage we return from
> linux_stop_and_wait_all_lwps, finish in try_thread_db_load_1, and
> eventually unwind back to the call to setup_inferior in
> stop_all_threads.  GDB now loops around, and performs another
> target_wait to get the next event from the inferior.
> 
> The target_wait calls causes us to once again reach linux_nat_wait_1,
> and we pass through some code that calls resume_stopped_resumed_lwps.
> This allows GDB to resume threads that are physically stopped, but which
> GDB doesn't see any good reason for the thread to remain stopped.  In
> our case, the problem thread which had its SIGSTOP discarded is stopped,
> but doesn't have a stashed wait status to report, and so GDB sets the
> thread going again.
> 
> We are now stuck waiting for an event on the problem thread that might
> never arrive.
> 
> When considering how to write a test for this bug I struggled.  The
> issue was only spotted _randomly_ when a machine was heavily loaded with
> many multi-threaded applications, and GDB was being attached (by script)
> to all of these applications in parallel.  In one reproducer I required
> around 5 applications each of 5 threads per machine core in order to
> reproduce the bug 2 out of 3 times.
> 
> What we really want to do though is simulate the kernel being slow to
> report events through waitpid during the initial attach.  The solution I
> came up with was to write an LD_PRELOAD library that intercepts (some)
> waitpid calls and rate limits them to one per-second.  Any more than
> that simply return 0 indicating there's no event available.  Obviously
> this can only be applied to waitpid calls that have the WNOHANG flag
> set.
> 
> Unfortunately, this library does break the "rest" of GDB, I suspect that
> the issue is that in some places GDB knows that there's an event
> pending, calls non-blocking waitpid and when the event fails to arrive
> GDB moves into some unexpected/broken state from which it can't recover.
> Still, the preload library does, at the moment, trigger the bug during
> attach.
> 
> gdb/ChangeLog:
> 
> 	* linux-nat.c (stop_wait_callback): Don't discard SIGSTOP if it
> 	was requested by GDB.
> 
> gdb/testsuite/ChangeLog:
> 
> 	* gdb.threads/attach-slow-waitpid.c: New file.
> 	* gdb.threads/attach-slow-waitpid.exp: New file.
> 	* gdb.threads/slow-waitpid.c: New file.
> ---
>  gdb/ChangeLog                                     |   6 ++
>  gdb/linux-nat.c                                   |  14 ++-
>  gdb/testsuite/ChangeLog                           |   7 ++
>  gdb/testsuite/gdb.threads/attach-slow-waitpid.c   |  77 ++++++++++++++
>  gdb/testsuite/gdb.threads/attach-slow-waitpid.exp |  95 ++++++++++++++++++
>  gdb/testsuite/gdb.threads/slow-waitpid.c          | 117 ++++++++++++++++++++++
>  6 files changed, 312 insertions(+), 4 deletions(-)
>  create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.c
>  create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
>  create mode 100644 gdb/testsuite/gdb.threads/slow-waitpid.c
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 445b59fa4ad..d1d0ba66d2e 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2527,17 +2527,23 @@ stop_wait_callback (struct lwp_info *lp, void *data)
>  	}
>        else
>  	{
> -	  /* We caught the SIGSTOP that we intended to catch, so
> -	     there's no SIGSTOP pending.  */
> +	  /* We caught the SIGSTOP that we intended to catch.  */
>  
>  	  if (debug_linux_nat)
>  	    fprintf_unfiltered (gdb_stdlog,
>  				"SWC: Expected SIGSTOP caught for %s.\n",
>  				target_pid_to_str (lp->ptid));
>  
> -	  /* Reset SIGNALLED only after the stop_wait_callback call
> -	     above as it does gdb_assert on SIGNALLED.  */
>  	  lp->signalled = 0;
> +
> +	  /* If we are waiting for this stop so we can report the thread
> +	     stopped then we need to record that this status.  Otherwise,
> +	     we can now discard this stop event.  */
> +	  if (lp->last_resume_kind == resume_stop)
> +	    {
> +	      lp->status = status;
> +	      save_stop_reason (lp);
> +	    }
>  	}
>      }
>  
> diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.c b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
> new file mode 100644
> index 00000000000..06e99ab22d1
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
> @@ -0,0 +1,77 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> +   Copyright 2018 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +#include <pthread.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <assert.h>
> +#define NUM_THREADS 4
> +
> +/* Crude spin lock.  Threads all spin until this is set to 0.  */
> +int go = 1;
> +
> +/* Thread function, just spin until GO is set to 0.  */
> +void *
> +perform_work (void *argument)
> +{
> +  /* Cast to volatile to ensure that ARGUMENT is loaded each time around
> +     the loop.  */
> +  while (*((volatile int*) argument))
> +    {
> +      /* Nothing.  */
> +    }
> +  return NULL;
> +}
> +
> +/* The spin loop for the main thread.  */
> +void
> +function (void)
> +{
> +  (void) perform_work (&go);
> +  printf ("Finished from function\n");
> +}
> +
> +/* Main program, create some threads which all spin waiting for GO to be
> +   set to 0.  */
> +int
> +main (void)
> +{
> +  pthread_t threads[NUM_THREADS];
> +  int result_code;
> +  unsigned index;
> +
> +  /* Create some threads.  */
> +  for (index = 0; index < NUM_THREADS; ++index)
> +    {
> +      printf ("In main: creating thread %d\n", index);
> +      result_code = pthread_create (&threads[index], NULL, perform_work, &go);
> +      assert (!result_code);
> +    }
> +
> +  function ();
> +
> +  /* Wait for each thread to complete.  */
> +  for (index = 0; index < NUM_THREADS; ++index)
> +    {
> +      /* Block until thread INDEX completes.  */
> +      result_code = pthread_join (threads[index], NULL);
> +      assert (!result_code);
> +      printf ("In main: thread %d has completed\n", index);
> +    }
> +  printf ("In main: All threads completed successfully\n");
> +  return 0;
> +}
> diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
> new file mode 100644
> index 00000000000..d4b12727842
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
> @@ -0,0 +1,95 @@
> +# Copyright 2018 Free Software Foundation, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +
> +# This test script tries to expose a bug in some of the use of waitpid
> +# in the linux native support within GDB.  The problem was spotted on
> +# systems which were heavily loaded when attaching to threaded test
> +# programs.  What happened was that during the initial attach, the
> +# loop of waitpid calls that normally received the stop events from
> +# each of the threads in the inferior was not receiving a stop event
> +# for some threads (the kernel just hadn't sent the stop event yet).
> +#
> +# GDB would then trigger a call to stop_all_threads which would
> +# continue to wait for all of the outstanding threads to stop, when
> +# the outstanding stop events finally arrived GDB would then
> +# (incorrectly) discard the stop event, resume the thread, and
> +# continue to wait for the thread to stop.... which it now never
> +# would.
> +#
> +# In order to try and expose this issue reliably, this test preloads a
> +# library that intercepts waitpid calls.  And waitpid calls targeting
> +# pid -1, and with the WNOHANG flag are rate limited so that only 1
> +# per second can complete, any additional calls are forced to return 0
> +# indicating no event waiting.  This is enough to trigger the bug
> +# during the attach phase, however, it breaks the rest of GDB's
> +# behaviour, so we can't do more than just attach with this library in
> +# place.
> +
> +# This test only works on Linux
> +if { ![isnative] || [is_remote host] || [use_gdb_stub]
> +     || ![istarget *-linux*] } {
> +    continue
> +}
> +
> +standard_testfile
> +
> +set libfile slow-waitpid
> +set libsrc "${srcdir}/${subdir}/${libfile}.c"
> +set libobj [standard_output_file ${libfile}.so]
> +
> +# Compile the preload library.  We only get away with this as we limit
> +# this test to running when ISNATIVE is true.
> +if { [gdb_compile_shlib $libsrc $libobj {debug}] != "" } then {
> +    return -1
> +}
> +
> +# Compile the test program
> +if  { [gdb_compile_pthreads "${srcdir}/${subdir}/${srcfile}" "${binfile}" executable {debug}] != "" } {
> +    return -1
> +}
> +
> +# Spawn GDB with LIB preloaded with LD_PRELOAD.  CMDLINE_OPTS are
> +# command line options passed to GDB.
> +
> +proc gdb_spawn_with_ld_preload {lib cmdline_opts} {
> +    global env
> +
> +    save_vars { env(LD_PRELOAD) } {
> +	if { ![info exists env(LD_PRELOAD) ]
> +	     || $env(LD_PRELOAD) == "" } {
> +	    set env(LD_PRELOAD) "$lib"
> +	} else {
> +	    append env(LD_PRELOAD) ":$lib"
> +	}
> +
> +	gdb_spawn_with_cmdline_opts $cmdline_opts
> +    }
> +}
> +
> +# Run test program in the background.
> +set test_spawn_id [spawn_wait_for_attach $binfile]
> +set testpid [spawn_id_get_pid $test_spawn_id]
> +
> +# Start GDB with preload library in place.
> +gdb_spawn_with_ld_preload $libobj ""
> +
> +# Load binary, and attach to running program.
> +gdb_load ${binfile}
> +gdb_test "attach $testpid" "Attaching to program.*" "attach to target"
> +
> +gdb_exit
> +
> +# Kill of test program.
> +kill_wait_spawned_process $test_spawn_id
> diff --git a/gdb/testsuite/gdb.threads/slow-waitpid.c b/gdb/testsuite/gdb.threads/slow-waitpid.c
> new file mode 100644
> index 00000000000..40b5c958899
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/slow-waitpid.c
> @@ -0,0 +1,117 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> +   Copyright 2018 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +#define _GNU_SOURCE
> +
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <sys/time.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <dlfcn.h>
> +#include <string.h>
> +#include <stdarg.h>
> +
> +/* Logging.  */
> +
> +static void
> +log_msg (const char *fmt, ...)
> +{
> +#ifdef LOGGING
> +  va_list ap;
> +
> +  va_start (ap, fmt);
> +  vfprintf (stderr, fmt, ap);
> +  va_end (ap);
> +#endif /* LOGGING */
> +}
> +
> +/* Error handling, message and exit.  */
> +
> +static void
> +error (const char *fmt, ...)
> +{
> +  va_list ap;
> +
> +  va_start (ap, fmt);
> +  vfprintf (stderr, fmt, ap);
> +  va_end (ap);
> +
> +  exit (EXIT_FAILURE);
> +}
> +
> +/* Return true (non-zero) if we should skip this call to waitpid, or false
> +   (zero) if this waitpid call should be handled with a call to the "real"
> +   waitpid library.  Allows 1 waitpid call per second.  */
> +
> +static int
> +should_skip_waitpid (void)
> +{
> +  /* When we last allowed a waitpid to complete.  */
> +  static struct timeval last_waitpid_time = { 0, 0 };
> +
> +  struct timeval *tv = &last_waitpid_time;
> +  if (tv->tv_sec == 0)
> +    {
> +      if (gettimeofday (tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +      return 0; /* Don't skip.  */
> +    }
> +  else
> +    {
> +      struct timeval new_tv;
> +
> +      if (gettimeofday (&new_tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +
> +      if ((new_tv.tv_sec - tv->tv_sec) < 1)
> +	return 1; /* Skip.  */
> +
> +      memcpy (tv, &new_tv, sizeof (new_tv));
> +    }
> +
> +  /* Don't skip.  */
> +  return 0;
> +}
> +
> +/* The waitpid entry point function.  */
> +
> +pid_t
> +waitpid (pid_t pid, int *wstatus, int options)
> +{
> +  typedef pid_t (*fptr_t) (pid_t, int *, int);
> +  static fptr_t real_func = NULL;
> +
> +  if (real_func == NULL)
> +    {
> +      real_func = dlsym (RTLD_NEXT, "waitpid");
> +      if (real_func == NULL)
> +	error ("error: failed to find real waitpid\n");
> +    }
> +
> +  log_msg ("waitpid (%d, %p, 0x%x)\n", pid, wstatus, options);
> +
> +  if ((options & WNOHANG) != 0
> +      && pid == -1
> +      && should_skip_waitpid ())
> +    {
> +      log_msg ("waitpid: skipping\n");
> +      return 0;
> +    }
> +
> +  return (*real_func) (pid, wstatus, options);
> +}
> -- 
> 2.14.3
>
  
Pedro Alves June 11, 2018, 4:40 p.m. UTC | #2
On 05/18/2018 11:56 AM, Andrew Burgess wrote:
> This patch fixes an issues where GDB would sometimes hang when attaching
> to a multi-threaded process.  This issues was especially likely to
> trigger if the machine (running the inferior) was under load.

"issues" -> "issue" two times.

> The target_wait call ends up in linux_nat_wait_1, which first checks to
> see if any threads already have stashed stop events to report, and if
> there are non then we enter a loop fetching as many events as possible

"non" -> "none"

> out of the kernel.  This event fetching is non-blocking, and we give up
> once the kernel has no more events ready to give us.
> 
> All of the events from the kernel are passed through

> What we really want to do though is simulate the kernel being slow to
> report events through waitpid during the initial attach.  The solution I
> came up with was to write an LD_PRELOAD library that intercepts (some)
> waitpid calls and rate limits them to one per-second.  Any more than
> that simply return 0 indicating there's no event available.  Obviously
> this can only be applied to waitpid calls that have the WNOHANG flag
> set.
> 
> Unfortunately, this library does break the "rest" of GDB, I suspect that
> the issue is that in some places GDB knows that there's an event
> pending, calls non-blocking waitpid and when the event fails to arrive
> GDB moves into some unexpected/broken state from which it can't recover.
> Still, the preload library does, at the moment, trigger the bug during
> attach.

The fix sounds good to me.

Thanks so much for adding a testcase.  That adds a lot of value.

Some more comments below.

> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 445b59fa4ad..d1d0ba66d2e 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2527,17 +2527,23 @@ stop_wait_callback (struct lwp_info *lp, void *data)
>  	}
>        else
>  	{
> -	  /* We caught the SIGSTOP that we intended to catch, so
> -	     there's no SIGSTOP pending.  */
> +	  /* We caught the SIGSTOP that we intended to catch.  */
>  
>  	  if (debug_linux_nat)
>  	    fprintf_unfiltered (gdb_stdlog,
>  				"SWC: Expected SIGSTOP caught for %s.\n",
>  				target_pid_to_str (lp->ptid));
>  
> -	  /* Reset SIGNALLED only after the stop_wait_callback call
> -	     above as it does gdb_assert on SIGNALLED.  */
>  	  lp->signalled = 0;
> +
> +	  /* If we are waiting for this stop so we can report the thread
> +	     stopped then we need to record that this status.  Otherwise,

Looks like on "that" too many in "that this"?

> +	     we can now discard this stop event.  */
> +	  if (lp->last_resume_kind == resume_stop)
> +	    {
> +	      lp->status = status;
> +	      save_stop_reason (lp);
> +	    }
>  	}
>      }


> +# This test script tries to expose a bug in some of the use of waitpid

"some of the uses" ?

> +# in the linux native support within GDB.  The problem was spotted on

"linux" -> "Linux".

> +# systems which were heavily loaded when attaching to threaded test
> +# programs.  What happened was that during the initial attach, the
> +# loop of waitpid calls that normally received the stop events from
> +# each of the threads in the inferior was not receiving a stop event
> +# for some threads (the kernel just hadn't sent the stop event yet).
> +#
> +# GDB would then trigger a call to stop_all_threads which would
> +# continue to wait for all of the outstanding threads to stop, when
> +# the outstanding stop events finally arrived GDB would then
> +# (incorrectly) discard the stop event, resume the thread, and
> +# continue to wait for the thread to stop.... which it now never
> +# would.
> +#
> +# In order to try and expose this issue reliably, this test preloads a
> +# library that intercepts waitpid calls.  And waitpid calls targeting
> +# pid -1, and with the WNOHANG flag are rate limited so that only 1
> +# per second can complete, any additional calls are forced to return 0
> +# indicating no event waiting.  

I find this last sentence starting with "And" hard to parse.  Maybe
you meant "All" instead of "And".  I'd also suggest splitting it in
two and dropping the comma after "-1".  Like:

All waitpid calls targeting pid -1 with the WNOHANG flag are rate
limited so that only 1 per second can complete.  Additional calls are
forced to return 0 indicating no event waiting.

> +
> +# Spawn GDB with LIB preloaded with LD_PRELOAD.  CMDLINE_OPTS are
> +# command line options passed to GDB.
> +
> +proc gdb_spawn_with_ld_preload {lib cmdline_opts} {
> +    global env
> +
> +    save_vars { env(LD_PRELOAD) } {
> +	if { ![info exists env(LD_PRELOAD) ]
> +	     || $env(LD_PRELOAD) == "" } {
> +	    set env(LD_PRELOAD) "$lib"
> +	} else {
> +	    append env(LD_PRELOAD) ":$lib"
> +	}
> +
> +	gdb_spawn_with_cmdline_opts $cmdline_opts
> +    }
> +}

The testcase currently fails with --target_board=native-extended-gdbserver:

 (gdb) attach 487
 Don't know how to attach.  Try "help target".
 (gdb) FAIL: gdb.threads/attach-slow-waitpid.exp: attach to target

It'd be really nice to make it work there too, considering that at
some point we may switch to using gdbserver's linux backend even
for native debugging.

The test fails there because that board overrides gdb_start to
automatically start gdbserver and have gdb connect to it, but not
gdb_spawn_with_cmdline_opts.  Can we simply use gdb_start instead?
It seems like no cmdline opts are passed down anyway.

In principle, the LD_PRELOAD will end up injecting the slow-waitpid
lib in gdbserver too.


> +static int
> +should_skip_waitpid (void)
> +{
> +  /* When we last allowed a waitpid to complete.  */
> +  static struct timeval last_waitpid_time = { 0, 0 };
> +
> +  struct timeval *tv = &last_waitpid_time;
> +  if (tv->tv_sec == 0)
> +    {
> +      if (gettimeofday (tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +      return 0; /* Don't skip.  */
> +    }
> +  else
> +    {
> +      struct timeval new_tv;
> +
> +      if (gettimeofday (&new_tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +
> +      if ((new_tv.tv_sec - tv->tv_sec) < 1)
> +	return 1; /* Skip.  */
> +
> +      memcpy (tv, &new_tv, sizeof (new_tv));

Write:

      *tv = new_tv;

?

Thanks,
Pedro Alves
  

Patch

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 445b59fa4ad..d1d0ba66d2e 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2527,17 +2527,23 @@  stop_wait_callback (struct lwp_info *lp, void *data)
 	}
       else
 	{
-	  /* We caught the SIGSTOP that we intended to catch, so
-	     there's no SIGSTOP pending.  */
+	  /* We caught the SIGSTOP that we intended to catch.  */
 
 	  if (debug_linux_nat)
 	    fprintf_unfiltered (gdb_stdlog,
 				"SWC: Expected SIGSTOP caught for %s.\n",
 				target_pid_to_str (lp->ptid));
 
-	  /* Reset SIGNALLED only after the stop_wait_callback call
-	     above as it does gdb_assert on SIGNALLED.  */
 	  lp->signalled = 0;
+
+	  /* If we are waiting for this stop so we can report the thread
+	     stopped then we need to record that this status.  Otherwise,
+	     we can now discard this stop event.  */
+	  if (lp->last_resume_kind == resume_stop)
+	    {
+	      lp->status = status;
+	      save_stop_reason (lp);
+	    }
 	}
     }
 
diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.c b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
new file mode 100644
index 00000000000..06e99ab22d1
--- /dev/null
+++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
@@ -0,0 +1,77 @@ 
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2018 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#define NUM_THREADS 4
+
+/* Crude spin lock.  Threads all spin until this is set to 0.  */
+int go = 1;
+
+/* Thread function, just spin until GO is set to 0.  */
+void *
+perform_work (void *argument)
+{
+  /* Cast to volatile to ensure that ARGUMENT is loaded each time around
+     the loop.  */
+  while (*((volatile int*) argument))
+    {
+      /* Nothing.  */
+    }
+  return NULL;
+}
+
+/* The spin loop for the main thread.  */
+void
+function (void)
+{
+  (void) perform_work (&go);
+  printf ("Finished from function\n");
+}
+
+/* Main program, create some threads which all spin waiting for GO to be
+   set to 0.  */
+int
+main (void)
+{
+  pthread_t threads[NUM_THREADS];
+  int result_code;
+  unsigned index;
+
+  /* Create some threads.  */
+  for (index = 0; index < NUM_THREADS; ++index)
+    {
+      printf ("In main: creating thread %d\n", index);
+      result_code = pthread_create (&threads[index], NULL, perform_work, &go);
+      assert (!result_code);
+    }
+
+  function ();
+
+  /* Wait for each thread to complete.  */
+  for (index = 0; index < NUM_THREADS; ++index)
+    {
+      /* Block until thread INDEX completes.  */
+      result_code = pthread_join (threads[index], NULL);
+      assert (!result_code);
+      printf ("In main: thread %d has completed\n", index);
+    }
+  printf ("In main: All threads completed successfully\n");
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
new file mode 100644
index 00000000000..d4b12727842
--- /dev/null
+++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
@@ -0,0 +1,95 @@ 
+# Copyright 2018 Free Software Foundation, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This test script tries to expose a bug in some of the use of waitpid
+# in the linux native support within GDB.  The problem was spotted on
+# systems which were heavily loaded when attaching to threaded test
+# programs.  What happened was that during the initial attach, the
+# loop of waitpid calls that normally received the stop events from
+# each of the threads in the inferior was not receiving a stop event
+# for some threads (the kernel just hadn't sent the stop event yet).
+#
+# GDB would then trigger a call to stop_all_threads which would
+# continue to wait for all of the outstanding threads to stop, when
+# the outstanding stop events finally arrived GDB would then
+# (incorrectly) discard the stop event, resume the thread, and
+# continue to wait for the thread to stop.... which it now never
+# would.
+#
+# In order to try and expose this issue reliably, this test preloads a
+# library that intercepts waitpid calls.  And waitpid calls targeting
+# pid -1, and with the WNOHANG flag are rate limited so that only 1
+# per second can complete, any additional calls are forced to return 0
+# indicating no event waiting.  This is enough to trigger the bug
+# during the attach phase, however, it breaks the rest of GDB's
+# behaviour, so we can't do more than just attach with this library in
+# place.
+
+# This test only works on Linux
+if { ![isnative] || [is_remote host] || [use_gdb_stub]
+     || ![istarget *-linux*] } {
+    continue
+}
+
+standard_testfile
+
+set libfile slow-waitpid
+set libsrc "${srcdir}/${subdir}/${libfile}.c"
+set libobj [standard_output_file ${libfile}.so]
+
+# Compile the preload library.  We only get away with this as we limit
+# this test to running when ISNATIVE is true.
+if { [gdb_compile_shlib $libsrc $libobj {debug}] != "" } then {
+    return -1
+}
+
+# Compile the test program
+if  { [gdb_compile_pthreads "${srcdir}/${subdir}/${srcfile}" "${binfile}" executable {debug}] != "" } {
+    return -1
+}
+
+# Spawn GDB with LIB preloaded with LD_PRELOAD.  CMDLINE_OPTS are
+# command line options passed to GDB.
+
+proc gdb_spawn_with_ld_preload {lib cmdline_opts} {
+    global env
+
+    save_vars { env(LD_PRELOAD) } {
+	if { ![info exists env(LD_PRELOAD) ]
+	     || $env(LD_PRELOAD) == "" } {
+	    set env(LD_PRELOAD) "$lib"
+	} else {
+	    append env(LD_PRELOAD) ":$lib"
+	}
+
+	gdb_spawn_with_cmdline_opts $cmdline_opts
+    }
+}
+
+# Run test program in the background.
+set test_spawn_id [spawn_wait_for_attach $binfile]
+set testpid [spawn_id_get_pid $test_spawn_id]
+
+# Start GDB with preload library in place.
+gdb_spawn_with_ld_preload $libobj ""
+
+# Load binary, and attach to running program.
+gdb_load ${binfile}
+gdb_test "attach $testpid" "Attaching to program.*" "attach to target"
+
+gdb_exit
+
+# Kill of test program.
+kill_wait_spawned_process $test_spawn_id
diff --git a/gdb/testsuite/gdb.threads/slow-waitpid.c b/gdb/testsuite/gdb.threads/slow-waitpid.c
new file mode 100644
index 00000000000..40b5c958899
--- /dev/null
+++ b/gdb/testsuite/gdb.threads/slow-waitpid.c
@@ -0,0 +1,117 @@ 
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2018 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#define _GNU_SOURCE
+
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/time.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <dlfcn.h>
+#include <string.h>
+#include <stdarg.h>
+
+/* Logging.  */
+
+static void
+log_msg (const char *fmt, ...)
+{
+#ifdef LOGGING
+  va_list ap;
+
+  va_start (ap, fmt);
+  vfprintf (stderr, fmt, ap);
+  va_end (ap);
+#endif /* LOGGING */
+}
+
+/* Error handling, message and exit.  */
+
+static void
+error (const char *fmt, ...)
+{
+  va_list ap;
+
+  va_start (ap, fmt);
+  vfprintf (stderr, fmt, ap);
+  va_end (ap);
+
+  exit (EXIT_FAILURE);
+}
+
+/* Return true (non-zero) if we should skip this call to waitpid, or false
+   (zero) if this waitpid call should be handled with a call to the "real"
+   waitpid library.  Allows 1 waitpid call per second.  */
+
+static int
+should_skip_waitpid (void)
+{
+  /* When we last allowed a waitpid to complete.  */
+  static struct timeval last_waitpid_time = { 0, 0 };
+
+  struct timeval *tv = &last_waitpid_time;
+  if (tv->tv_sec == 0)
+    {
+      if (gettimeofday (tv, NULL) < 0)
+	error ("error: gettimeofday failed\n");
+      return 0; /* Don't skip.  */
+    }
+  else
+    {
+      struct timeval new_tv;
+
+      if (gettimeofday (&new_tv, NULL) < 0)
+	error ("error: gettimeofday failed\n");
+
+      if ((new_tv.tv_sec - tv->tv_sec) < 1)
+	return 1; /* Skip.  */
+
+      memcpy (tv, &new_tv, sizeof (new_tv));
+    }
+
+  /* Don't skip.  */
+  return 0;
+}
+
+/* The waitpid entry point function.  */
+
+pid_t
+waitpid (pid_t pid, int *wstatus, int options)
+{
+  typedef pid_t (*fptr_t) (pid_t, int *, int);
+  static fptr_t real_func = NULL;
+
+  if (real_func == NULL)
+    {
+      real_func = dlsym (RTLD_NEXT, "waitpid");
+      if (real_func == NULL)
+	error ("error: failed to find real waitpid\n");
+    }
+
+  log_msg ("waitpid (%d, %p, 0x%x)\n", pid, wstatus, options);
+
+  if ((options & WNOHANG) != 0
+      && pid == -1
+      && should_skip_waitpid ())
+    {
+      log_msg ("waitpid: skipping\n");
+      return 0;
+    }
+
+  return (*real_func) (pid, wstatus, options);
+}