[PATCH/RFC] gdbserver: Add command line option to not use SO_REUSEADDR

Message ID 20190212184827.91720-1-alan.hayward@arm.com
State New, archived
Headers

Commit Message

Alan Hayward Feb. 12, 2019, 6:48 p.m. UTC
  (Not sure if there is any additional documentation that will need
 updating alongside this patch, or any other additions that need
 to go alongside new command line flags).

When running the testsuite using target_board=native-gdbserver
across many processors, there are random failures generated by timeouts
connecting to gdbserver.  Each timeout will cause the whole of an
.exp test script to fail (creating potentially dozens of FAILs).

Per full make check run:
On X86-64 Ubuntu 16.04, I see approximately 1 of these.
On AArch64 Ubuntu 16.04 / Centos 7.4, I see approximately 3 of these.

Forcing on gdb and gdbserver debug shows that for the failure cases
gdbserver is launched, connects to gdb and sucessfully sends and
receives packets.  Meanwhile on the gdb side, no packets are sent or
received and the connection simply times out.  This indicates that
the gdb from another running test has connected to this gdbserver.

On the gdbserver side, the socket is set to SO_REUSEADDR.  This
allows the socket to be quickly reused on exit of gdbserver.  Without
it the socket has to fully timeout in the OS before it can be reused
again (typically between 30 and 120 seconds).

Removing the SO_REUSEADDR code solves the issue.

One downside of this is an increase in the number of ports being used.
When running with -j55 on HEAD, I see roughly 100 different ports used
(ie ports 2346 to 2436).  With the patch applied I see almost 300 ports
used (ie ports 2346 to 2627).

The other downside is that launching dozens of gdbservers at the same
time is not the usual user workflow.  The user is more likely to connect
to gdbserver, do some work, kill it, then launch again - here the user
will expect to be able to reuse the same port.

This patch adds new command line flag to enable/disable SO_REUSEADDR.
By default use of SO_REUSEADDR remains on and so the use will not see
any change.  In the testsuite, always start gdbserver with it disabled.

Tested on AArch64 and X86_64 with default board, native gdbserver
and native extended gdbserver.

gdb/gdbserver/ChangeLog:

2019-02-12  Alan Hayward  <alan.hayward@arm.com>

	* remote-utils.c (remote_prepare): Check reuse flag.
	* server.c (gdbserver_usage): Add reuse help messages.
	(captured_main): Check for reuse flags
	* server.h (struct client_state): Add reuse flag.

gdb/testsuite/ChangeLog:

2019-02-12  Alan Hayward  <alan.hayward@arm.com>

	* lib/gdbserver-support.exp: Start gdbserver without socket reuse.
---
 gdb/gdbserver/remote-utils.c            | 12 +++++++-----
 gdb/gdbserver/server.c                  | 10 ++++++++++
 gdb/gdbserver/server.h                  | 10 ++++++++++
 gdb/testsuite/lib/gdbserver-support.exp |  3 +++
 4 files changed, 30 insertions(+), 5 deletions(-)
  

Comments

Tom Tromey Feb. 13, 2019, 9:05 p.m. UTC | #1
>>>>> "Alan" == Alan Hayward <Alan.Hayward@arm.com> writes:

Alan> (Not sure if there is any additional documentation that will need
Alan>  updating alongside this patch, or any other additions that need
Alan>  to go alongside new command line flags).

There is some documentation on invoking gdbserver in the manual, so that
should be updated.

Alan> Forcing on gdb and gdbserver debug shows that for the failure cases
Alan> gdbserver is launched, connects to gdb and sucessfully sends and
Alan> receives packets.  Meanwhile on the gdb side, no packets are sent or
Alan> received and the connection simply times out.  This indicates that
Alan> the gdb from another running test has connected to this gdbserver.

Could you walk me through how this happens?  It seems strange to me that
there could be any kind of race.  What I don't understand is that if two
tests can run in parallel, both start gdbserver, and then one gdb can
connect to the wrong server -- then it seems to me that the bug must be
in the test suite itself?

Alan> When running with -j55 on HEAD

Jealous.

Alan> 2019-02-12  Alan Hayward  <alan.hayward@arm.com>
Alan> 	* remote-utils.c (remote_prepare): Check reuse flag.
Alan> 	* server.c (gdbserver_usage): Add reuse help messages.
Alan> 	(captured_main): Check for reuse flags
Alan> 	* server.h (struct client_state): Add reuse flag.

FWIW the contents of the patch all seem fine to me.  But, I'd like to
understand the problem better before approving it (and also there's the
documentation).

thanks,
Tom
  
Tom Tromey Feb. 14, 2019, 2:33 p.m. UTC | #2
Alan> When running with -j55 on HEAD

Tom> Jealous.

Also, I was wondering if this speeds things up.  When I last looked at
this (a long time ago now), parallelizing the test suite didn't help
past -j12 or -j15 or so, because there were a couple of very long tests
that were "tent poles" after this.  As machines get more CPUs, though,
perhaps we should be looking into splitting those up.

thanks,
Tom
  
Alan Hayward Feb. 14, 2019, 4:31 p.m. UTC | #3
> On 14 Feb 2019, at 14:33, Tom Tromey <tom@tromey.com> wrote:

> 

> Alan> When running with -j55 on HEAD

> 

> Tom> Jealous.

> 


:) But, the disadvantage of large server boxes is building and running
everything remotely, requiring network access.  You gain something, you
loose something.


> Also, I was wondering if this speeds things up.  When I last looked at

> this (a long time ago now), parallelizing the test suite didn't help

> past -j12 or -j15 or so, because there were a couple of very long tests

> that were "tent poles" after this.  As machines get more CPUs, though,

> perhaps we should be looking into splitting those up.

> 


Gave this a quick try on make check for gdb.

-j55:  1m42.106s
-j40:  1m42.595s
-j30:  1m44.216s
-j20:  1m56.600s
-j10:  3m16.514s
-j5:   6m0.417s

There’s clearly a tailing off going on above 20. Although, under two minutes
seems reasonable enough to me. I couldn't think an obvious way to quickly
see what the longest running tests were. Could also be test suite overhead.
Suspect results might be more extreme with the gdbserver boards.


The socket clash is still going to happen too at -j10, will just be less
frequent.


Alan.
  
Tom Tromey Feb. 14, 2019, 10:48 p.m. UTC | #4
>>>>> "Alan" == Alan Hayward <Alan.Hayward@arm.com> writes:

Alan> There’s clearly a tailing off going on above 20. Although, under two minutes
Alan> seems reasonable enough to me.

Yeah.

Alan> I couldn't think an obvious way to quickly
Alan> see what the longest running tests were.

I think what I did is just "time runtest x/y.exp" for each test file in
the tree, then sorted the results to find the slow tests.

Tom
  
Alan Hayward Feb. 22, 2019, 5:14 p.m. UTC | #5
> On 13 Feb 2019, at 21:05, Tom Tromey <tom@tromey.com> wrote:

> 

>>>>>> "Alan" == Alan Hayward <Alan.Hayward@arm.com> writes:

> 

> Alan> (Not sure if there is any additional documentation that will need

> Alan>  updating alongside this patch, or any other additions that need

> Alan>  to go alongside new command line flags).

> 

> There is some documentation on invoking gdbserver in the manual, so that

> should be updated.


Noted. I’ll put that into the next version, or raise an additional patch.

> 

> Alan> Forcing on gdb and gdbserver debug shows that for the failure cases

> Alan> gdbserver is launched, connects to gdb and sucessfully sends and

> Alan> receives packets.  Meanwhile on the gdb side, no packets are sent or

> Alan> received and the connection simply times out.  This indicates that

> Alan> the gdb from another running test has connected to this gdbserver.

> 

> Could you walk me through how this happens?  It seems strange to me that

> there could be any kind of race.  What I don't understand is that if two

> tests can run in parallel, both start gdbserver, and then one gdb can

> connect to the wrong server -- then it seems to me that the bug must be

> in the test suite itself?

> 


Sorry, totally missed replying to this part.  I’ll gather some more evidence
and get back to you next week.



> Alan> When running with -j55 on HEAD

> 

> Jealous.

> 

> Alan> 2019-02-12  Alan Hayward  <alan.hayward@arm.com>

> Alan> 	* remote-utils.c (remote_prepare): Check reuse flag.

> Alan> 	* server.c (gdbserver_usage): Add reuse help messages.

> Alan> 	(captured_main): Check for reuse flags

> Alan> 	* server.h (struct client_state): Add reuse flag.

> 

> FWIW the contents of the patch all seem fine to me.  But, I'd like to

> understand the problem better before approving it (and also there's the

> documentation).

> 

> thanks,

> Tom
  

Patch

diff --git a/gdb/gdbserver/remote-utils.c b/gdb/gdbserver/remote-utils.c
index ad0228db99..b46c42da0b 100644
--- a/gdb/gdbserver/remote-utils.c
+++ b/gdb/gdbserver/remote-utils.c
@@ -233,7 +233,6 @@  remote_prepare (const char *name)
 #ifdef USE_WIN32API
   static int winsock_initialized;
 #endif
-  socklen_t tmp;
 
   remote_is_stdio = 0;
   if (strcmp (name, STDIO_CONNECTION_NAME) == 0)
@@ -297,10 +296,13 @@  remote_prepare (const char *name)
   if (iter == NULL)
     perror_with_name ("Can't open socket");
 
-  /* Allow rapid reuse of this port. */
-  tmp = 1;
-  setsockopt (listen_desc, SOL_SOCKET, SO_REUSEADDR, (char *) &tmp,
-	      sizeof (tmp));
+  /* Allow rapid reuse of this port.  */
+  if (cs.socket_reuse)
+    {
+      socklen_t tmp = 1;
+      setsockopt (listen_desc, SOL_SOCKET, SO_REUSEADDR, (char *) &tmp,
+		  sizeof (tmp));
+    }
 
   switch (iter->ai_family)
     {
diff --git a/gdb/gdbserver/server.c b/gdb/gdbserver/server.c
index e960c10d40..c6bcd26bee 100644
--- a/gdb/gdbserver/server.c
+++ b/gdb/gdbserver/server.c
@@ -3453,6 +3453,12 @@  gdbserver_usage (FILE *stream)
 	   "                        Exec PROG directly instead of using a shell.\n"
 	   "                        Disables argument globbing and variable substitution\n"
 	   "                        on UNIX-like systems.\n"
+	   "  --reuse-socket\n"
+	   "                        Set the connection socket to allow reuse of local\n"
+	   "                        addresses.  (default)\n"
+	   "  --no-reuse-socket\n"
+	   "                        Do not set the connection socket to allow reuse of\n"
+	   "                        local addresses.\n"
 	   "\n"
 	   "Debug options:\n"
 	   "\n"
@@ -3710,6 +3716,10 @@  captured_main (int argc, char *argv[])
 	startup_with_shell = false;
       else if (strcmp (*next_arg, "--once") == 0)
 	run_once = 1;
+      else if (strcmp (*next_arg, "--reuse-socket") == 0)
+	cs.socket_reuse = true;
+      else if (strcmp (*next_arg, "--no-reuse-socket") == 0)
+	cs.socket_reuse = false;
       else if (strcmp (*next_arg, "--selftest") == 0)
 	selftest = true;
       else if (startswith (*next_arg, "--selftest="))
diff --git a/gdb/gdbserver/server.h b/gdb/gdbserver/server.h
index 400001addd..0ac90a93ce 100644
--- a/gdb/gdbserver/server.h
+++ b/gdb/gdbserver/server.h
@@ -198,6 +198,16 @@  struct client_state
 
   int current_traceframe = -1;
 
+  /* Should the connection socket be set to allow the use of local addresses via
+     the socket option SO_REUSEADDR.  The default user desired behavior is for
+     the socket to become available immediately after gdbserver exits, so that
+     the same port can be used after restarting gdbserver.  When running large
+     numbers of short lived gdbservers in quick succession across many
+     processors (such as during the testsuite) this can cause race conditions
+     in socket allocation.  When disabled, after gdbserver exits, the socket has
+     to timeout before the OS will reuse it, causing the potential for
+     additional socket addresses being used.  */
+  bool socket_reuse = true;
 };
 
 client_state &get_client_state ();
diff --git a/gdb/testsuite/lib/gdbserver-support.exp b/gdb/testsuite/lib/gdbserver-support.exp
index 05234c43bd..63f1a5343c 100644
--- a/gdb/testsuite/lib/gdbserver-support.exp
+++ b/gdb/testsuite/lib/gdbserver-support.exp
@@ -278,6 +278,9 @@  proc gdbserver_start { options arguments } {
 	# Fire off the debug agent.
 	set gdbserver_command "$gdbserver"
 
+	# Prevent the potential for port address clashes
+	append gdbserver_command " --no-reuse-socket"
+
 	# If gdbserver_reconnect will be called $gdbserver_reconnect_p must be
 	# set to true already during gdbserver_start.
 	global gdbserver_reconnect_p