gai_cancel()

Message ID 1497627984.6717.32.camel@pbcl.net
State New, archived
Headers

Commit Message

Phil Blundell June 16, 2017, 3:46 p.m. UTC
  According to the Linux manpage for getaddrinfo_a(), the behaviour of
gai_cancel() is as follows:

"The gai_cancel() function cancels the request req.  If the request has
been  canceled  successfully,  the  error  status  of  the request will
be set to EAI_CANCELED and normal asynchronous notification will be
performed."

However, this doesn't seem to match the actual behaviour.  From my
tests and inspection of the code, if a request is successfully
cancelled then asynchronous notification is not performed: the request
is simply dropped from the queue.  This wouldn't in itself be
unreasonable because the application can tell from the return value
whether or not this happened and deal with it appropriately.  Initially
I assumed that the code as written was correct and the documentation
was simply missing a crucial "not".

But, it further turns out that if notification is not performed, and
hence __gai_notify is not called, the async_waitlist struct that was
allocated by getaddrinfo_a() is leaked.  I've attached a testcase that
demonstrates the leak.

Clearly, changing gai_cancel() to perform notification would make the
code match the documentation and would also fix the memory leak.  But
it doesn't seem impossible that there might be applications which are
now relying on the current behaviour and might break if this change
were made.  Since getaddrinfo_a() doesn't appear to be part of any
standard, and it also isn't documented in the glibc manual, I wasn't
able to find any other reference that describes what the behaviour is
supposed to be.

Opinions?

thanks

p.
  

Comments

Florian Weimer June 16, 2017, 3:52 p.m. UTC | #1
On 06/16/2017 05:46 PM, Phil Blundell wrote:
> Opinions?

Is this related to bug 20874?

  https://sourceware.org/bugzilla/show_bug.cgi?id=20874

Thanks,
Florian
  
Joseph Myers June 16, 2017, 3:57 p.m. UTC | #2
Note that test-driver.c should now be used instead of test-skeleton.c.
  
Rical Jasan June 17, 2017, 4:47 p.m. UTC | #3
On 06/16/2017 08:46 AM, Phil Blundell wrote:
> getaddrinfo_a() doesn't appear to be part of any
> standard, and it also isn't documented in the glibc manual

sockets.texi is pretty out-of-date (to be nice).  It's hard to try and
write networking code using the glibc manual because so much of what you
see in use isn't in there anywhere.  I started working on getaddrinfo,
sockaddr_storage, etc., a while back, and have a few patches laying
around somewhere I need to dust off, but anything substantial with the
chapter is likely in the 2.27-2.28 timeline for me, unless someone else
gets to it first.

It's definitely on the radar though.

Rical
  
Phil Blundell June 19, 2017, 11:04 a.m. UTC | #4
On Fri, 2017-06-16 at 17:52 +0200, Florian Weimer wrote:
> On 06/16/2017 05:46 PM, Phil Blundell wrote:
> > Opinions?
> 
> Is this related to bug 20874?
> 
>   https://sourceware.org/bugzilla/show_bug.cgi?id=20874
> 

No, I don't think it's related.  I poked at that bug a bit and I am
fairly sure that the problem there is specifically in gai_suspend(). 
Under conditions that I don't entirely understand yet, we seem to be
somehow returning from gai_suspend while its waitlist[] entry is still
linked into requestlist->waiting.  Since gai_suspend() allocated its
waitlist[] on the stack, this leads rapidly to disaster when gai_notify
subsequently tries to traverse the linked list.

p.
  
Florian Weimer June 19, 2017, 11:51 a.m. UTC | #5
On 06/19/2017 01:04 PM, Phil Blundell wrote:
> On Fri, 2017-06-16 at 17:52 +0200, Florian Weimer wrote:
>> On 06/16/2017 05:46 PM, Phil Blundell wrote:
>>> Opinions?
>>
>> Is this related to bug 20874?
>>
>>   https://sourceware.org/bugzilla/show_bug.cgi?id=20874
>>
> 
> No, I don't think it's related.  I poked at that bug a bit and I am
> fairly sure that the problem there is specifically in gai_suspend(). 
> Under conditions that I don't entirely understand yet, we seem to be
> somehow returning from gai_suspend while its waitlist[] entry is still
> linked into requestlist->waiting.  Since gai_suspend() allocated its
> waitlist[] on the stack, this leads rapidly to disaster when gai_notify
> subsequently tries to traverse the linked list.

Okay, thanks for investigating.

How does the memory leak happen?  Would another notification eventually
deallocate the struct async_waitlist object?

Thanks,
Florian
  
Phil Blundell June 19, 2017, 12:27 p.m. UTC | #6
On Mon, 2017-06-19 at 13:51 +0200, Florian Weimer wrote:
> How does the memory leak happen?  Would another notification
> eventually
> deallocate the struct async_waitlist object?

The async_waitlist is allocated in getaddrinfo_a and, under normal
circumstances, freed by this dubious-looking code in __gai_notify():

	    /* This is tricky.  See getaddrinfo_a.c for the reason why
	       this works.  */
	    free ((void *) waitlist->counterp);

If gai_cancel() removes the entry from the request queue then, as the
code stands today, nothing will cause __gai_notify() to be called for
it and hence the async_waitlist is never freed.  I don't think there is
any mechanism that will cause another notification to eventually
deallocate the memory.

But, I ran a testcase (basically identical to the tst-leaks3 that I
posted before) under valgrind and it considers those blocks to be
"possibly lost" rather than the "definitely lost" that I was expecting,
so perhaps there is some internal object that tracks them after all.  I
do find all the twisty little data structures involved in
getaddrinfo_a() particularly hard to keep straight in my head so it's
entirely possible I have overlooked something.

p.
  
Phil Blundell June 19, 2017, 2:36 p.m. UTC | #7
On Mon, 2017-06-19 at 13:27 +0100, Phil Blundell wrote:
> If gai_cancel() removes the entry from the request queue then, as the
> code stands today, nothing will cause __gai_notify() to be called for
> it and hence the async_waitlist is never freed. 

From inspection of the code I think there is another leak too:
__gai_remove_request() unlinks "runp" from the request list, but as far
as I can tell there is nothing that will ever put it back in the
freelist so that element is essentially lost.  

This leak is not immediately obvious from the mtrace or valgrind
output, because all the "struct requestlist" objects live in a single
block that was calloc()ed in get_elem() and the block itself is not
lost, but if you run a test that simply calls getaddrinfo_a() and
gai_cancel() in a loop then you can see that the process size keeps
growing without any obvious bound.  (On my machine it reached 4G of rss
within a couple of seconds at which point I killed it.)

p.
  
Joseph Myers June 19, 2017, 3:04 p.m. UTC | #8
On Sat, 17 Jun 2017, Rical Jasan wrote:

> On 06/16/2017 08:46 AM, Phil Blundell wrote:
> > getaddrinfo_a() doesn't appear to be part of any
> > standard, and it also isn't documented in the glibc manual
> 
> sockets.texi is pretty out-of-date (to be nice).  It's hard to try and
> write networking code using the glibc manual because so much of what you
> see in use isn't in there anywhere.  I started working on getaddrinfo,
> sockaddr_storage, etc., a while back, and have a few patches laying
> around somewhere I need to dust off, but anything substantial with the
> chapter is likely in the 2.27-2.28 timeline for me, unless someone else
> gets to it first.

Note the undocumentedness of getaddrinfo / getnameinfo is bug 17202 (I'm 
sure there's a lot more undocumented than there are open bugs for, 
however).
  

Patch

From 11ec70a0bcd3bef4595020ada88303d349884f22 Mon Sep 17 00:00:00 2001
From: Phil Blundell <pb@pbcl.net>
Date: Fri, 16 Jun 2017 16:36:52 +0100
Subject: [PATCH] resolv/tst-leaks3.c: New test

---
 resolv/Makefile     | 15 +++++++++++----
 resolv/tst-leaks3.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+), 4 deletions(-)
 create mode 100644 resolv/tst-leaks3.c

diff --git a/resolv/Makefile b/resolv/Makefile
index dc597ca097..22926a9e5b 100644
--- a/resolv/Makefile
+++ b/resolv/Makefile
@@ -31,9 +31,9 @@  routines := herror inet_addr inet_ntop inet_pton nsap_addr res_init \
 	    res_hconf res_libc res-state
 
 tests = tst-aton tst-leaks tst-inet_ntop
-xtests = tst-leaks2
+xtests = tst-leaks2 tst-leaks3
 
-generate := mtrace-tst-leaks.out tst-leaks.mtrace tst-leaks2.mtrace
+generate := mtrace-tst-leaks.out tst-leaks.mtrace tst-leaks2.mtrace tst-leaks3.mtrace
 
 extra-libs := libresolv libnss_dns
 ifeq ($(have-thread-library),yes)
@@ -95,7 +95,7 @@  endif
 ifeq ($(run-built-tests),yes)
 ifneq (no,$(PERL))
 tests-special += $(objpfx)mtrace-tst-leaks.out
-xtests-special += $(objpfx)mtrace-tst-leaks2.out
+xtests-special += $(objpfx)mtrace-tst-leaks2.out $(objpfx)mtrace-tst-leaks3.out
 endif
 endif
 
@@ -107,7 +107,8 @@  headers += rpc/netdb.h
 endif
 
 generated += mtrace-tst-leaks.out tst-leaks.mtrace \
-	     mtrace-tst-leaks2.out tst-leaks2.mtrace
+	     mtrace-tst-leaks2.out tst-leaks2.mtrace \
+	     mtrace-tst-leaks3.out tst-leaks3.mtrace
 
 include ../Rules
 
@@ -135,6 +136,12 @@  $(objpfx)mtrace-tst-leaks2.out: $(objpfx)tst-leaks2.out
 	$(common-objpfx)malloc/mtrace $(objpfx)tst-leaks2.mtrace > $@; \
 	$(evaluate-test)
 
+$(objpfx)tst-leaks3: $(objpfx)libresolv.so $(objpfx)libanl.so
+tst-leaks3-ENV = MALLOC_TRACE=$(objpfx)tst-leaks3.mtrace
+$(objpfx)mtrace-tst-leaks3.out: $(objpfx)tst-leaks3.out
+	$(common-objpfx)malloc/mtrace $(objpfx)tst-leaks3.mtrace > $@; \
+	$(evaluate-test)
+
 $(objpfx)tst-bug18665-tcp: $(objpfx)libresolv.so $(shared-thread-library)
 $(objpfx)tst-bug18665: $(objpfx)libresolv.so $(shared-thread-library)
 $(objpfx)tst-res_use_inet6: $(objpfx)libresolv.so $(shared-thread-library)
diff --git a/resolv/tst-leaks3.c b/resolv/tst-leaks3.c
new file mode 100644
index 0000000000..93f06e9e84
--- /dev/null
+++ b/resolv/tst-leaks3.c
@@ -0,0 +1,54 @@ 
+/* Tests for getaddrinfo_a
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <mcheck.h>
+#include <netdb.h>
+#include <resolv.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+static int
+do_test (void)
+{
+  mtrace ();
+
+  for (int i = 0; i < 1000; ++i)
+    {
+      struct gaicb *req[1];
+      req[0] = malloc (sizeof (*req[0]));
+      memset (req[0], 0, sizeof (*req[0]));
+      req[0]->ar_name = "www.gnu.org";
+
+      int ret = getaddrinfo_a (GAI_NOWAIT, req, 1, NULL);
+      if (ret != 0)
+	{
+	  fprintf (stderr, "getaddrinfo_a() failed: %s\n",
+		   gai_strerror(ret));
+	  exit (EXIT_FAILURE);
+	}
+
+      gai_cancel (req[0]);
+    }
+  return 0;
+}
+
+#define TEST_FUNCTION do_test ()
+#define TIMEOUT 30
+/* This defines the `main' function and some more.  */
+#include <test-skeleton.c>
-- 
2.11.0