[RFC] fix gdb.threads/non-stop-fair-events.exp timeouts

Message ID 55EF0D11.2020200@redhat.com
State New, archived
Headers

Commit Message

Pedro Alves Sept. 8, 2015, 4:30 p.m. UTC
  On 09/04/2015 05:54 PM, Sandra Loosemore wrote:
> While running GDB tests on nios2-linux-gnu with gdbserver and "target 
> remote", I've been seeing random failures in 
> gdb.threads/non-stop-fair-events.exp.  E.g. in one test run I got
> 
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=6: thread 1 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=6: thread 2 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=6: thread 3 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=7: thread 1 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=10: thread 1 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=10: thread 2 
> broke out of loop (timeout)
> 
> and in other test runs I got a different ones.  The pattern seemed to be 
> that sometimes it took an extra long time for the first thread to break 
> out of the loop, but once that happened they would all stop correctly 
> and send the expected replies even though GDB had given up on waiting 
> for the first few already.

Yeah, I've seen this before with a local series I use for debugging
software single-step things that implements software single-stepping
on x86.  I just re-tried it now after rebasing that series to
current mainline, and I still see the time outs against gdbserver.

AFAICS, nios2 is a software single-step target that does not implement
displaced stepping either.  I had a patch for this that I had
never posted.  See attached.

> 
> I've come up with the attached patch to factor the timeout for the 
> failing tests by the number of threads still running, which seems to 
> take care of the problem.  Does this seem reasonable?

I'd rather avoid it unconditionally; it's just 10 threads, and if the
target supports displaced stepping, if starvation avoidance in gdb is
working correctly, the test should complete quickly.  I takes a couple
seconds on my getting-old x86-64 laptop.

> 
> I'm somewhat confused because, in spite of it sometimes taking at least 
> 3 times the normal timeout for the first stop message to appear, the 
> alarm in the test case (which is tied to the normal timeout) was never 
> triggering.  My best theory on that is that the slowness is not in the 
> test case, but rather in gdbserver.  IOW, all the threads are already 
> stopped by the time the alarm would expire, but gdb and gdbserver 
> haven't finished all the notifications and requests to print a stop 
> message for any of the threads yet.  Is that plausible?  Should the 
> timeout for the alarm be factored by the number of threads, too, just to 
> be safe?

Or maybe it was, and the SIGALRM never manages to be processed by gdb
and passed down to the inferior.

> 
> I'm also not entirely sure what this test case is supposed to test. 
>  From the original commit message and comments in the .exp file it seems 
> like timeouts were supposed to be a sign of a broken kernel with thread 
> starvation problems, not bugs in gdb or gdbserver.  

On the kernel side, "waitpid(-1, ...)" just walks the task list linearly
looking for the first that had an event.  Say you have two threads, A and
B which are constantly hitting events/breakpoints.  If A is quick enough,
"waitpid(-1, ...)" returns the event for thread A over and over, and thread B is
starved.  The linux backends in both gdb and gdbserver have code in place that
picks an event LWP at random out of all that have had events.  A similar problem
exists as soon as events are queued out of the target backends into gdb's core
run control -- events can end up pending for processing later in gdb's core
data structures too, and so if gdb just picked those events by walking its own
thread list looking for the first thread that has an event pending, it'd starve
some threads.  So again infrun.c has similar randomization code to avoid
starvation (random_pending_event_thread).

> But, don't we 
> normally just skip tests that the target doesn't support or can't run 
> properly, rather than report them as FAILs?  And, I don't know how to 
> distinguish timeouts that mean the kernel is broken from timeouts that 
> mean the target is just slow and you need to set a bigger value in the 
> test harness.

Pedro Alves
  

Comments

Sandra Loosemore Sept. 9, 2015, 4:08 p.m. UTC | #1
On 09/08/2015 10:30 AM, Pedro Alves wrote:
>
> Yeah, I've seen this before with a local series I use for debugging
> software single-step things that implements software single-stepping
> on x86.  I just re-tried it now after rebasing that series to
> current mainline, and I still see the time outs against gdbserver.
>
> AFAICS, nios2 is a software single-step target that does not implement
> displaced stepping either.  I had a patch for this that I had
> never posted.  See attached.
>

Hmmm, these two patches are not working for me.  The trouble is that 
this part:

> +gdb_test_multiple "si" $msg {
> +    -re "displaced pc to.*$gdb_prompt $" {
> +	set displaced_stepping_enabled 1
> +    }
> +    -re ".*$gdb_prompt $" {
> +    }
> +}

is causing the target to step from main to pthread_self, which is in a 
different file.  This causes the subsequent breakpoint commands to fail, 
and things go south from there:

Breakpoint 1, main () at 
/scratch/sandra/nios2-linux-trunk/src/gdb-trunk/gdb/testsuite/gdb.threads/non-stop-fair-events.c:76
76	  pthread_kill (pthread_self (), 0);
(gdb) handle SIGUSR1 print nostop pass
Signal        Stop	Print	Pass to program	Description
SIGUSR1       No	Yes	Yes		User defined signal 1
(gdb) PASS: gdb.threads/non-stop-fair-events.exp: handle SIGUSR1 print 
nostop pass
print num_threads
$1 = 10
(gdb) PASS: gdb.threads/non-stop-fair-events.exp: get num_threads
set debug displaced 1
(gdb) PASS: gdb.threads/non-stop-fair-events.exp: set debug displaced 1
si
0x00002720 in pthread_self () at pthread_self.c:27
27	}
(gdb) set debug displaced 0
(gdb) PASS: gdb.threads/non-stop-fair-events.exp: set debug displaced 0
delete breakpoints
Delete all breakpoints? (y or n) y
(gdb) info breakpoints
No breakpoints or watchpoints.
(gdb) print got_sig = 0
$2 = 0
(gdb) PASS: gdb.threads/non-stop-fair-events.exp: signal_thread=2: print 
got_sig = 0
break 63
No line 63 in the current file.
Make breakpoint pending on future shared library load? (y or [n]) n
(gdb) FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=2: 
setting breakpoint at 63
break 88
No line 88 in the current file.
Make breakpoint pending on future shared library load? (y or [n]) n
(gdb) FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=2: 
setting breakpoint at 88

-Sandra
  
Pedro Alves Sept. 9, 2015, 6:40 p.m. UTC | #2
On 09/09/2015 05:08 PM, Sandra Loosemore wrote:
> On 09/08/2015 10:30 AM, Pedro Alves wrote:
>>
>> Yeah, I've seen this before with a local series I use for debugging
>> software single-step things that implements software single-stepping
>> on x86.  I just re-tried it now after rebasing that series to
>> current mainline, and I still see the time outs against gdbserver.
>>
>> AFAICS, nios2 is a software single-step target that does not implement
>> displaced stepping either.  I had a patch for this that I had
>> never posted.  See attached.
>>
> 
> Hmmm, these two patches are not working for me.  The trouble is that 
> this part:
> 
>> +gdb_test_multiple "si" $msg {
>> +    -re "displaced pc to.*$gdb_prompt $" {
>> +	set displaced_stepping_enabled 1
>> +    }
>> +    -re ".*$gdb_prompt $" {
>> +    }
>> +}
> 
> is causing the target to step from main to pthread_self, which is in a 
> different file.  This causes the subsequent breakpoint commands to fail, 
> and things go south from there:

OK, I got "lucky" on x86 and a stepi runs some instruction before the call.
Could you try simply replacing the "si" with "next" ?  It doesn't matter
whether that issues several single-steps or not.  What matters is that gdb
tries to step past the breakpoint that is set at the current PC (from the
earlier runto_main).  We're trying to figure out if gdb uses displaced
stepping for that.

Thanks,
Pedro Alves
  
Sandra Loosemore Sept. 15, 2015, 2:24 a.m. UTC | #3
On 09/09/2015 12:40 PM, Pedro Alves wrote:
> On 09/09/2015 05:08 PM, Sandra Loosemore wrote:
>> On 09/08/2015 10:30 AM, Pedro Alves wrote:
>>>
>>> Yeah, I've seen this before with a local series I use for debugging
>>> software single-step things that implements software single-stepping
>>> on x86.  I just re-tried it now after rebasing that series to
>>> current mainline, and I still see the time outs against gdbserver.
>>>
>>> AFAICS, nios2 is a software single-step target that does not implement
>>> displaced stepping either.  I had a patch for this that I had
>>> never posted.  See attached.
>>>
>>
>> Hmmm, these two patches are not working for me.  The trouble is that
>> this part:
>>
>>> +gdb_test_multiple "si" $msg {
>>> +    -re "displaced pc to.*$gdb_prompt $" {
>>> +	set displaced_stepping_enabled 1
>>> +    }
>>> +    -re ".*$gdb_prompt $" {
>>> +    }
>>> +}
>>
>> is causing the target to step from main to pthread_self, which is in a
>> different file.  This causes the subsequent breakpoint commands to fail,
>> and things go south from there:
>
> OK, I got "lucky" on x86 and a stepi runs some instruction before the call.
> Could you try simply replacing the "si" with "next" ?  It doesn't matter
> whether that issues several single-steps or not.  What matters is that gdb
> tries to step past the breakpoint that is set at the current PC (from the
> earlier runto_main).  We're trying to figure out if gdb uses displaced
> stepping for that.

Yes, that fixes the trouble, and the tests run OK now.  I did find that 
it still timed out 1 of the 5 times I ran it, though, so maybe the 
timeout factor really does need to match NUM_THREADS to be safe?

-Sandra
  

Patch

From ccd1e1cca1409ea8eb6f564561280f345f321077 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 8 Sep 2015 17:26:02 +0100
Subject: [PATCH 2/2] non-stop-fair-events.exp slower on software single-step
 && !displ-step targets

On software single-step targets that don't support displaced stepping,
threads keep hitting each other's single-step breakpoints, and then
GDB needs to pause all threads to step past those.  The end result is
that progress in the main thread will be slower and it may take a bit
longer for the signal to be queued.  This patch bumps the timeout on
such targets.

gdb/testsuite/ChangeLog:
2015-09-08  Pedro Alves  <palves@redhat.com>

	* gdb.threads/non-stop-fair-events.c (timeout): New global.
	(SECONDS): Redefine.
	(main): Call pthread_kill and alarm early.
	* gdb.threads/non-stop-fair-events.exp: Probe displaced stepping
	support.
	(test): If the target can't hardware step and doesn't support
	displaced stepping, increase the timeout.
---
 gdb/testsuite/gdb.threads/non-stop-fair-events.c   |  9 ++-
 gdb/testsuite/gdb.threads/non-stop-fair-events.exp | 91 +++++++++++++++-------
 gdb/testsuite/lib/gdb.exp                          | 20 +++--
 3 files changed, 82 insertions(+), 38 deletions(-)

diff --git a/gdb/testsuite/gdb.threads/non-stop-fair-events.c b/gdb/testsuite/gdb.threads/non-stop-fair-events.c
index f82c366..700676b 100644
--- a/gdb/testsuite/gdb.threads/non-stop-fair-events.c
+++ b/gdb/testsuite/gdb.threads/non-stop-fair-events.c
@@ -24,7 +24,9 @@ 
 const int num_threads = NUM_THREADS;
 /* Allow for as much timeout as DejaGnu wants, plus a bit of
    slack.  */
-#define SECONDS (TIMEOUT + 20)
+
+volatile unsigned int timeout = TIMEOUT;
+#define SECONDS (timeout + 20)
 
 pthread_t child_thread[NUM_THREADS];
 volatile pthread_t signal_thread;
@@ -69,6 +71,11 @@  main (void)
   int res;
   int i;
 
+  /* Call these early so that we're sure their PLTs are quickly
+     resolved now, instead of in the busy threads.  */
+  pthread_kill (pthread_self (), 0);
+  alarm (0);
+
   signal (SIGUSR1, handler);
 
   for (i = 0; i < NUM_THREADS; i++)
diff --git a/gdb/testsuite/gdb.threads/non-stop-fair-events.exp b/gdb/testsuite/gdb.threads/non-stop-fair-events.exp
index 37f5bcb..ba14f6a 100644
--- a/gdb/testsuite/gdb.threads/non-stop-fair-events.exp
+++ b/gdb/testsuite/gdb.threads/non-stop-fair-events.exp
@@ -62,6 +62,19 @@  set NUM_THREADS [get_value "num_threads" "get num_threads"]
 # Account for the main thread.
 incr NUM_THREADS
 
+# Probe for displaced stepping support.
+set displaced_stepping_enabled 0
+set msg "check displaced-stepping"
+gdb_test_no_output "set debug displaced 1"
+gdb_test_multiple "si" $msg {
+    -re "displaced pc to.*$gdb_prompt $" {
+	set displaced_stepping_enabled 1
+    }
+    -re ".*$gdb_prompt $" {
+    }
+}
+gdb_test_no_output "set debug displaced 0"
+
 # Run threads to their start positions.  This prepares for a new test
 # sequence.
 
@@ -127,6 +140,8 @@  proc enable_debug {enable} {
 proc test {signal_thread} {
     global gdb_prompt
     global NUM_THREADS
+    global timeout
+    global displaced_stepping_enabled
 
     with_test_prefix "signal_thread=$signal_thread" {
 	restart
@@ -152,42 +167,58 @@  proc test {signal_thread} {
 
 	enable_debug 1
 
-	set saw_continuing 0
-	set test "continue &"
-	gdb_test_multiple $test $test {
-	    -re "Continuing.\r\n" {
-		set saw_continuing 1
-		exp_continue
-	    }
-	    -re "$gdb_prompt " {
-		gdb_assert $saw_continuing $test
-	    }
-	    -re "infrun:" {
-		exp_continue
-	    }
+	# On software single-step targets that don't support displaced
+	# stepping, threads keep hitting each others' single-step
+	# breakpoints, and then GDB needs to pause all threads to step
+	# past those.  The end result is that progress in the main
+	# thread will be slower and it may take a bit longer for the
+	# signal to be queued; bump the timeout.
+	if {!$displaced_stepping_enabled && ![can_hardware_single_step]} {
+	    set factor 5
+	} else {
+	    set factor 1
 	}
-
-	set gotit 0
-
-	# Wait for all threads to finish their steps, and for the main
-	# thread to hit the breakpoint.
-	for {set i 1} { $i <= $NUM_THREADS } { incr i } {
-	    set test "thread $i broke out of loop"
-	    set gotit 0
-	    gdb_test_multiple "" $test {
-		-re "loop_broke" {
-		    # The prompt was already matched in the "continue
-		    # &" test above.  We're now consuming asynchronous
-		    # output that comes after the prompt.
-		    set gotit 1
-		    pass $test
+	with_timeout_factor $factor {
+	    gdb_test "print timeout = $timeout" " = $timeout" \
+		"set timeout in the inferior"
+
+	    set saw_continuing 0
+	    set test "continue &"
+	    gdb_test_multiple $test $test {
+		-re "Continuing.\r\n" {
+		    set saw_continuing 1
+		    exp_continue
+		}
+		-re "$gdb_prompt " {
+		    gdb_assert $saw_continuing $test
 		}
 		-re "infrun:" {
 		    exp_continue
 		}
 	    }
-	    if {!$gotit} {
-		break
+
+	    set gotit 0
+
+	    # Wait for all threads to finish their steps, and for the main
+	    # thread to hit the breakpoint.
+	    for {set i 1} { $i <= $NUM_THREADS } { incr i } {
+		set test "thread $i broke out of loop"
+		set gotit 0
+		gdb_test_multiple "" $test {
+		    -re "loop_broke" {
+			# The prompt was already matched in the "continue
+			# &" test above.  We're now consuming asynchronous
+			# output that comes after the prompt.
+			set gotit 1
+			pass $test
+		    }
+		    -re "infrun:" {
+			exp_continue
+		    }
+		}
+		if {!$gotit} {
+		    break
+		}
 	    }
 	}
 
diff --git a/gdb/testsuite/lib/gdb.exp b/gdb/testsuite/lib/gdb.exp
index 56cde7a..9eaf721 100644
--- a/gdb/testsuite/lib/gdb.exp
+++ b/gdb/testsuite/lib/gdb.exp
@@ -2150,15 +2150,10 @@  proc supports_get_siginfo_type {} {
     }
 }
 
-# Return 1 if target hardware or OS supports single stepping to signal
-# handler, otherwise, return 0.
+# Return 1 if the target supports hardware single stepping.
 
-proc can_single_step_to_signal_handler {} {
+proc can_hardware_single_step {} {
 
-    # Targets don't have hardware single step.  On these targets, when
-    # a signal is delivered during software single step, gdb is unable
-    # to determine the next instruction addresses, because start of signal
-    # handler is one of them.
     if { [istarget "arm*-*-*"] || [istarget "mips*-*-*"]
 	 || [istarget "tic6x-*-*"] || [istarget "sparc*-*-linux*"]
 	 || [istarget "nios2-*-*"] } {
@@ -2168,6 +2163,17 @@  proc can_single_step_to_signal_handler {} {
     return 1
 }
 
+# Return 1 if target hardware or OS supports single stepping to signal
+# handler, otherwise, return 0.
+
+proc can_single_step_to_signal_handler {} {
+    # Targets don't have hardware single step.  On these targets, when
+    # a signal is delivered during software single step, gdb is unable
+    # to determine the next instruction addresses, because start of signal
+    # handler is one of them.
+    return [can_hardware_single_step]
+}
+
 # Return 1 if target supports process record, otherwise return 0.
 
 proc supports_process_record {} {
-- 
1.9.3