Message ID | 20180810095750.13017-1-andrew.burgess@embecosm.com |
---|---|
State | New, archived |
Headers |
Received: (qmail 36642 invoked by alias); 10 Aug 2018 09:58:04 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <gdb-patches.sourceware.org> List-Unsubscribe: <mailto:gdb-patches-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/gdb-patches/> List-Post: <mailto:gdb-patches@sourceware.org> List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: gdb-patches-owner@sourceware.org Delivered-To: mailing list gdb-patches@sourceware.org Received: (qmail 36520 invoked by uid 89); 10 Aug 2018 09:58:03 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-26.4 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=Running X-HELO: mail-wm0-f68.google.com Received: from mail-wm0-f68.google.com (HELO mail-wm0-f68.google.com) (74.125.82.68) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 10 Aug 2018 09:58:01 +0000 Received: by mail-wm0-f68.google.com with SMTP id r24-v6so1025266wmh.0 for <gdb-patches@sourceware.org>; Fri, 10 Aug 2018 02:58:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=embecosm.com; s=google; h=from:to:cc:subject:date:message-id; bh=bwf311PBiW6PTZvqJpKGU6+1Teoll+trjXw6N+cpefg=; b=H2B0VawBxv9aEoY5JJ5MnRecd2yifBCIh/HNIevgk3QaHqYGDWuAT+ngm9XLFOL8Wj nPZ6loSpoi4u3faM4Hqr7CLQ5XR8Dg8UkidgNw+MH/QzxXzqKEEx4wSehnqjg1Rb8zpp dDmAPeCEhWyqJ6pqb6aTIYgda4+lBGoPjtJ2PL/jJdcUghEfuasiymDHrlbiMtFAVYKo f+GArjk2PzqFbQulc6qMDTBDu1NU5pr/FjVJ6GDoTzGDv8BOibypj5z/5AstZX3fchPP 49kql4grJpHGWkWC+qDs+kdVw7KcjVuHuZqRzjkKJUpGL0BMv+L3i4vKybzK+ZnlaIfL I8JA== Return-Path: <andrew.burgess@embecosm.com> Received: from localhost (host81-140-215-41.range81-140.btcentralplus.com. [81.140.215.41]) by smtp.gmail.com with ESMTPSA id h83-v6sm1138768wmf.46.2018.08.10.02.57.58 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 10 Aug 2018 02:57:58 -0700 (PDT) From: Andrew Burgess <andrew.burgess@embecosm.com> To: gdb-patches@sourceware.org Cc: Andrew Burgess <andrew.burgess@embecosm.com> Subject: [PATCH] gdb: Fix instability in thread groups test Date: Fri, 10 Aug 2018 10:57:50 +0100 Message-Id: <20180810095750.13017-1-andrew.burgess@embecosm.com> X-IsSubscribed: yes |
Commit Message
Andrew Burgess
Aug. 10, 2018, 9:57 a.m. UTC
In the test script gdb.mi/list-thread-groups-available.exp we ask GDB to list all thread groups, and match the output against a regexp. Occasionally, I would see this test fail. The expected output is a list of entries, each entry looking roughly like this: {id="<DECIMAL>",type="process",description="<STRING>", user="<STRING>",cores=["<DECIMAL>","<DECIMAL>",...]} All the fields after 'id' and 'type' are optional, and the 'cores' list can contain 1 or more "<DECIMAL>" entries. On my machine (Running Fedora 27, kernel 4.17.3-100.fc27.x86_64) usually the 'description' is a non-empty string, and the 'cores' list has at least one entry in it. But sometimes, very rarely, I'll see an entry in the process group list where the 'description' is an empty string, the 'user' is the string "?", and the 'cores' list is empty. Such an entry looks like this: {id="19863",type="process",description="",user="?",cores=[]} I believe that this is caused by the process exiting while GDB is scanning /proc for process information. The current code in gdb/nat/linux-osdata.c is not (I think) resilient against exiting processes. This commit adjusts the regex that matches the 'cores' list so that an empty list is acceptable, with this patch in place the test script gdb.mi/list-thread-groups-available.exp never fails for me now. gdb/testsuite/ChangeLog: * gdb.mi/list-thread-groups-available.exp: Update test regexp. --- gdb/testsuite/ChangeLog | 4 ++++ gdb/testsuite/gdb.mi/list-thread-groups-available.exp | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-)
Comments
On 2018-08-10 05:57, Andrew Burgess wrote: > In the test script gdb.mi/list-thread-groups-available.exp we ask GDB > to list all thread groups, and match the output against a > regexp. Occasionally, I would see this test fail. > > The expected output is a list of entries, each entry looking roughly > like this: > > {id="<DECIMAL>",type="process",description="<STRING>", > user="<STRING>",cores=["<DECIMAL>","<DECIMAL>",...]} > > All the fields after 'id' and 'type' are optional, and the 'cores' > list can contain 1 or more "<DECIMAL>" entries. > > On my machine (Running Fedora 27, kernel 4.17.3-100.fc27.x86_64) > usually the 'description' is a non-empty string, and the 'cores' list > has at least one entry in it. But sometimes, very rarely, I'll see an > entry in the process group list where the 'description' is an empty > string, the 'user' is the string "?", and the 'cores' list is empty. > Such an entry looks like this: > > {id="19863",type="process",description="",user="?",cores=[]} > > I believe that this is caused by the process exiting while GDB is > scanning /proc for process information. The current code in > gdb/nat/linux-osdata.c is not (I think) resilient against exiting > processes. > > This commit adjusts the regex that matches the 'cores' list so that an > empty list is acceptable, with this patch in place the test script > gdb.mi/list-thread-groups-available.exp never fails for me now. > > gdb/testsuite/ChangeLog: > > * gdb.mi/list-thread-groups-available.exp: Update test regexp. > --- > gdb/testsuite/ChangeLog | 4 ++++ > gdb/testsuite/gdb.mi/list-thread-groups-available.exp | 2 +- > 2 files changed, 5 insertions(+), 1 deletion(-) > > diff --git a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp > b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp > index c4dab2a2c34..88f9ee9b63d 100644 > --- a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp > +++ b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp > @@ -45,7 +45,7 @@ set id_re "id=\"$decimal\"" > set type_re "type=\"process\"" > set description_re "description=\"$string_re\"" > set user_re "user=\"$string_re\"" > -set cores_re "cores=\\\[\"$decimal\"(,\"$decimal\")*\\\]" > +set cores_re "cores=\\\[(\"$decimal\"(,\"$decimal\")*)?\\\]" > > # List all available processes. > set process_entry_re > "{${id_re},${type_re}(,$description_re)?(,$user_re)?(,$cores_re)?}" Hi Andrew, The patch LGTM. I manually reproduced this case by spawning a process (tail -f /dev/null) and noting its pid. In linux_xfer_osdata_processes, I added: if (pid == <pid>) sleep (5); and killing the process during that sleep. Simon
On 08/10/2018 10:26 PM, Simon Marchi wrote: > On 2018-08-10 05:57, Andrew Burgess wrote: >> In the test script gdb.mi/list-thread-groups-available.exp we ask GDB >> to list all thread groups, and match the output against a >> regexp. Occasionally, I would see this test fail. >> >> The expected output is a list of entries, each entry looking roughly >> like this: >> >> {id="<DECIMAL>",type="process",description="<STRING>", >> user="<STRING>",cores=["<DECIMAL>","<DECIMAL>",...]} >> >> All the fields after 'id' and 'type' are optional, and the 'cores' >> list can contain 1 or more "<DECIMAL>" entries. >> >> On my machine (Running Fedora 27, kernel 4.17.3-100.fc27.x86_64) >> usually the 'description' is a non-empty string, and the 'cores' list >> has at least one entry in it. But sometimes, very rarely, I'll see an >> entry in the process group list where the 'description' is an empty >> string, the 'user' is the string "?", and the 'cores' list is empty. >> Such an entry looks like this: >> >> {id="19863",type="process",description="",user="?",cores=[]} >> >> I believe that this is caused by the process exiting while GDB is >> scanning /proc for process information. The current code in >> gdb/nat/linux-osdata.c is not (I think) resilient against exiting >> processes. >> >> This commit adjusts the regex that matches the 'cores' list so that an >> empty list is acceptable, with this patch in place the test script >> gdb.mi/list-thread-groups-available.exp never fails for me now. >> >> gdb/testsuite/ChangeLog: >> >> * gdb.mi/list-thread-groups-available.exp: Update test regexp. >> --- >> gdb/testsuite/ChangeLog | 4 ++++ >> gdb/testsuite/gdb.mi/list-thread-groups-available.exp | 2 +- >> 2 files changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp >> b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp >> index c4dab2a2c34..88f9ee9b63d 100644 >> --- a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp >> +++ b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp >> @@ -45,7 +45,7 @@ set id_re "id=\"$decimal\"" >> set type_re "type=\"process\"" >> set description_re "description=\"$string_re\"" >> set user_re "user=\"$string_re\"" >> -set cores_re "cores=\\\[\"$decimal\"(,\"$decimal\")*\\\]" >> +set cores_re "cores=\\\[(\"$decimal\"(,\"$decimal\")*)?\\\]" >> >> # List all available processes. >> set process_entry_re >> "{${id_re},${type_re}(,$description_re)?(,$user_re)?(,$cores_re)?}" > > Hi Andrew, > > The patch LGTM. I manually reproduced this case by spawning a process (tail -f /dev/null) and noting its pid. In linux_xfer_osdata_processes, I added: > > if (pid == <pid>) > sleep (5); > > and killing the process during that sleep. But shouldn't we make GDB handle this better? Make the output more "atomic" in the sense that we either show a valid complete entry, or no entry? There's an inherent race here, since we use multiple /proc accesses to fill up a process entry. If we start fetching process info for a process, and the process disappears midway, I'd think it better to discard that process's entry, as-if we had not even seen it, i.e., as if we had listed the set of processes a tiny moment later. Thanks, Pedro Alves
* Pedro Alves <palves@redhat.com> [2018-08-13 10:51:44 +0100]: > On 08/10/2018 10:26 PM, Simon Marchi wrote: > > On 2018-08-10 05:57, Andrew Burgess wrote: > >> In the test script gdb.mi/list-thread-groups-available.exp we ask GDB > >> to list all thread groups, and match the output against a > >> regexp. Occasionally, I would see this test fail. > >> > >> The expected output is a list of entries, each entry looking roughly > >> like this: > >> > >> {id="<DECIMAL>",type="process",description="<STRING>", > >> user="<STRING>",cores=["<DECIMAL>","<DECIMAL>",...]} > >> > >> All the fields after 'id' and 'type' are optional, and the 'cores' > >> list can contain 1 or more "<DECIMAL>" entries. > >> > >> On my machine (Running Fedora 27, kernel 4.17.3-100.fc27.x86_64) > >> usually the 'description' is a non-empty string, and the 'cores' list > >> has at least one entry in it. But sometimes, very rarely, I'll see an > >> entry in the process group list where the 'description' is an empty > >> string, the 'user' is the string "?", and the 'cores' list is empty. > >> Such an entry looks like this: > >> > >> {id="19863",type="process",description="",user="?",cores=[]} > >> > >> I believe that this is caused by the process exiting while GDB is > >> scanning /proc for process information. The current code in > >> gdb/nat/linux-osdata.c is not (I think) resilient against exiting > >> processes. > >> > >> This commit adjusts the regex that matches the 'cores' list so that an > >> empty list is acceptable, with this patch in place the test script > >> gdb.mi/list-thread-groups-available.exp never fails for me now. > >> > >> gdb/testsuite/ChangeLog: > >> > >> * gdb.mi/list-thread-groups-available.exp: Update test regexp. > >> --- > >> gdb/testsuite/ChangeLog | 4 ++++ > >> gdb/testsuite/gdb.mi/list-thread-groups-available.exp | 2 +- > >> 2 files changed, 5 insertions(+), 1 deletion(-) > >> > >> diff --git a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp > >> b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp > >> index c4dab2a2c34..88f9ee9b63d 100644 > >> --- a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp > >> +++ b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp > >> @@ -45,7 +45,7 @@ set id_re "id=\"$decimal\"" > >> set type_re "type=\"process\"" > >> set description_re "description=\"$string_re\"" > >> set user_re "user=\"$string_re\"" > >> -set cores_re "cores=\\\[\"$decimal\"(,\"$decimal\")*\\\]" > >> +set cores_re "cores=\\\[(\"$decimal\"(,\"$decimal\")*)?\\\]" > >> > >> # List all available processes. > >> set process_entry_re > >> "{${id_re},${type_re}(,$description_re)?(,$user_re)?(,$cores_re)?}" > > > > Hi Andrew, > > > > The patch LGTM. I manually reproduced this case by spawning a process (tail -f /dev/null) and noting its pid. In linux_xfer_osdata_processes, I added: > > > > if (pid == <pid>) > > sleep (5); > > > > and killing the process during that sleep. > > But shouldn't we make GDB handle this better? Make the output > more "atomic" in the sense that we either show a valid complete > entry, or no entry? There's an inherent race > here, since we use multiple /proc accesses to fill up a process > entry. If we start fetching process info for a process, and the process > disappears midway, I'd think it better to discard that process's entry, > as-if we had not even seen it, i.e., as if we had listed the set of > processes a tiny moment later. I agree. We also need to think about process reuse. So with multiple accesses to /proc we might start with one process, and end up with a completely new process. I might be overthinking it, but my first guess at a reliable strategy would be: 1. Find each /proc/PID directory. 2. Read /proc/PID/stat and extract the start time. Failure to read this causes the process to be abandoned. 3. Read all of the other /proc/PID/XXX files as needed. Any failure results in the process being abandoned. 4. Reread /proc/PID/stat and confirm the start time hasn't changed, this would indicate a new process having slipped in. Given the system is still running, we can never be sure that we have "all" processes, so throwing out anything that looks wrong seems like the right strategy. Also in step #4 we know we've just missed a process - something new has started, but we ignore it. I think this is fine though given the racy nature of this sort of thing... The only question is, could these thoughts be dropped into a bug report, and the original patch to remove the unstable result applied? Or maybe the test updated to either PASS or KFAIL? Thanks, Andrew
On 08/13/2018 12:41 PM, Andrew Burgess wrote: > * Pedro Alves <palves@redhat.com> [2018-08-13 10:51:44 +0100]: > >> But shouldn't we make GDB handle this better? Make the output >> more "atomic" in the sense that we either show a valid complete >> entry, or no entry? There's an inherent race >> here, since we use multiple /proc accesses to fill up a process >> entry. If we start fetching process info for a process, and the process >> disappears midway, I'd think it better to discard that process's entry, >> as-if we had not even seen it, i.e., as if we had listed the set of >> processes a tiny moment later. > > I agree. > > We also need to think about process reuse. So with multiple accesses > to /proc we might start with one process, and end up with a completely > new process. > > I might be overthinking it, but my first guess at a reliable strategy > would be: > > 1. Find each /proc/PID directory. > 2. Read /proc/PID/stat and extract the start time. Failure to read > this causes the process to be abandoned. > 3. Read all of the other /proc/PID/XXX files as needed. Any failure > results in the process being abandoned. > 4. Reread /proc/PID/stat and confirm the start time hasn't changed, > this would indicate a new process having slipped in. > My initial quick thought was just to drop the process entry if it turns out we end up with an empty core set. I wonder whether we can prevent PID reuse by keeping a descriptor for /proc/PID/ open while we open the other files. Probably not. Otherwise, your scheme sounds like the next best. > Given the system is still running, we can never be sure that we have > "all" processes, so throwing out anything that looks wrong seems like > the right strategy. > > Also in step #4 we know we've just missed a process - something new > has started, but we ignore it. I think this is fine though given the > racy nature of this sort of thing... > > The only question is, could these thoughts be dropped into a bug > report, Sure. > and the original patch to remove the unstable result applied? > Or maybe the test updated to either PASS or KFAIL? I'd prefer the KFAIL option. At the very least, a comment in the .exp file. Thanks, Pedro Alves
* Pedro Alves <palves@redhat.com> [2018-08-13 13:03:47 +0100]: > On 08/13/2018 12:41 PM, Andrew Burgess wrote: > > * Pedro Alves <palves@redhat.com> [2018-08-13 10:51:44 +0100]: > > > >> But shouldn't we make GDB handle this better? Make the output > >> more "atomic" in the sense that we either show a valid complete > >> entry, or no entry? There's an inherent race > >> here, since we use multiple /proc accesses to fill up a process > >> entry. If we start fetching process info for a process, and the process > >> disappears midway, I'd think it better to discard that process's entry, > >> as-if we had not even seen it, i.e., as if we had listed the set of > >> processes a tiny moment later. > > > > I agree. > > > > We also need to think about process reuse. So with multiple accesses > > to /proc we might start with one process, and end up with a completely > > new process. > > > > I might be overthinking it, but my first guess at a reliable strategy > > would be: > > > > 1. Find each /proc/PID directory. > > 2. Read /proc/PID/stat and extract the start time. Failure to read > > this causes the process to be abandoned. > > 3. Read all of the other /proc/PID/XXX files as needed. Any failure > > results in the process being abandoned. > > 4. Reread /proc/PID/stat and confirm the start time hasn't changed, > > this would indicate a new process having slipped in. > > > > My initial quick thought was just to drop the process entry if > it turns out we end up with an empty core set. > > I wonder whether we can prevent PID reuse by keeping a descriptor > for /proc/PID/ open while we open the other files. Probably not. That was my first though, I tried: - chdir /proc/PID - opendir for /proc/PID - Kill /proc/PID - Read from the opendir handle, find nothing there. Which didn't really surprise me, but was worth a try... > Otherwise, your scheme sounds like the next best. > > > Given the system is still running, we can never be sure that we have > > "all" processes, so throwing out anything that looks wrong seems like > > the right strategy. > > > > Also in step #4 we know we've just missed a process - something new > > has started, but we ignore it. I think this is fine though given the > > racy nature of this sort of thing... > > > > The only question is, could these thoughts be dropped into a bug > > report, > > > Sure. > > > > and the original patch to remove the unstable result applied? > > Or maybe the test updated to either PASS or KFAIL? > > I'd prefer the KFAIL option. At the very least, a comment in > the .exp file. I'll put something together... Thanks, Andrew
On 08/13/2018 02:01 PM, Andrew Burgess wrote: > * Pedro Alves <palves@redhat.com> [2018-08-13 13:03:47 +0100]: >> I wonder whether we can prevent PID reuse by keeping a descriptor >> for /proc/PID/ open while we open the other files. Probably not. > > That was my first though, I tried: > > - chdir /proc/PID > - opendir for /proc/PID > > - Kill /proc/PID > > - Read from the opendir handle, find nothing there. > > Which didn't really surprise me, but was worth a try... Does it return "nothing else" even if you don't kill the process? Or does returning nothing indicate the process is gone already? Regardless, I don't think that proves keeping the opendir dir handle open (or some other file under /proc/PID) does not prevent the kernel from reusing the PID until the handle is closed, even though I do suspect it does not. But thinking a bit more, maybe it's useless to try to detect PID reuse, because the process we're collecting info for can just as well exec, which makes the info we had collected so far become invalid in pretty much the same way... >>> and the original patch to remove the unstable result applied? >>> Or maybe the test updated to either PASS or KFAIL? >> >> I'd prefer the KFAIL option. At the very least, a comment in >> the .exp file. > > I'll put something together... Maybe it's not worth the bother. After thinking about it some more, I'll be happy with a comment in the .exp file. Pedro Alves
diff --git a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp index c4dab2a2c34..88f9ee9b63d 100644 --- a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp +++ b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp @@ -45,7 +45,7 @@ set id_re "id=\"$decimal\"" set type_re "type=\"process\"" set description_re "description=\"$string_re\"" set user_re "user=\"$string_re\"" -set cores_re "cores=\\\[\"$decimal\"(,\"$decimal\")*\\\]" +set cores_re "cores=\\\[(\"$decimal\"(,\"$decimal\")*)?\\\]" # List all available processes. set process_entry_re "{${id_re},${type_re}(,$description_re)?(,$user_re)?(,$cores_re)?}"