gdb: Fix instability in thread groups test

Message ID 20180810095750.13017-1-andrew.burgess@embecosm.com
State New, archived
Headers

Commit Message

Andrew Burgess Aug. 10, 2018, 9:57 a.m. UTC
  In the test script gdb.mi/list-thread-groups-available.exp we ask GDB
to list all thread groups, and match the output against a
regexp. Occasionally, I would see this test fail.

The expected output is a list of entries, each entry looking roughly
like this:

  {id="<DECIMAL>",type="process",description="<STRING>",
   user="<STRING>",cores=["<DECIMAL>","<DECIMAL>",...]}

All the fields after 'id' and 'type' are optional, and the 'cores'
list can contain 1 or more "<DECIMAL>" entries.

On my machine (Running Fedora 27, kernel 4.17.3-100.fc27.x86_64)
usually the 'description' is a non-empty string, and the 'cores' list
has at least one entry in it.  But sometimes, very rarely, I'll see an
entry in the process group list where the 'description' is an empty
string, the 'user' is the string "?", and the 'cores' list is empty.
Such an entry looks like this:

   {id="19863",type="process",description="",user="?",cores=[]}

I believe that this is caused by the process exiting while GDB is
scanning /proc for process information.  The current code in
gdb/nat/linux-osdata.c is not (I think) resilient against exiting
processes.

This commit adjusts the regex that matches the 'cores' list so that an
empty list is acceptable, with this patch in place the test script
gdb.mi/list-thread-groups-available.exp never fails for me now.

gdb/testsuite/ChangeLog:

	* gdb.mi/list-thread-groups-available.exp: Update test regexp.
---
 gdb/testsuite/ChangeLog                               | 4 ++++
 gdb/testsuite/gdb.mi/list-thread-groups-available.exp | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)
  

Comments

Simon Marchi Aug. 10, 2018, 9:26 p.m. UTC | #1
On 2018-08-10 05:57, Andrew Burgess wrote:
> In the test script gdb.mi/list-thread-groups-available.exp we ask GDB
> to list all thread groups, and match the output against a
> regexp. Occasionally, I would see this test fail.
> 
> The expected output is a list of entries, each entry looking roughly
> like this:
> 
>   {id="<DECIMAL>",type="process",description="<STRING>",
>    user="<STRING>",cores=["<DECIMAL>","<DECIMAL>",...]}
> 
> All the fields after 'id' and 'type' are optional, and the 'cores'
> list can contain 1 or more "<DECIMAL>" entries.
> 
> On my machine (Running Fedora 27, kernel 4.17.3-100.fc27.x86_64)
> usually the 'description' is a non-empty string, and the 'cores' list
> has at least one entry in it.  But sometimes, very rarely, I'll see an
> entry in the process group list where the 'description' is an empty
> string, the 'user' is the string "?", and the 'cores' list is empty.
> Such an entry looks like this:
> 
>    {id="19863",type="process",description="",user="?",cores=[]}
> 
> I believe that this is caused by the process exiting while GDB is
> scanning /proc for process information.  The current code in
> gdb/nat/linux-osdata.c is not (I think) resilient against exiting
> processes.
> 
> This commit adjusts the regex that matches the 'cores' list so that an
> empty list is acceptable, with this patch in place the test script
> gdb.mi/list-thread-groups-available.exp never fails for me now.
> 
> gdb/testsuite/ChangeLog:
> 
> 	* gdb.mi/list-thread-groups-available.exp: Update test regexp.
> ---
>  gdb/testsuite/ChangeLog                               | 4 ++++
>  gdb/testsuite/gdb.mi/list-thread-groups-available.exp | 2 +-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> index c4dab2a2c34..88f9ee9b63d 100644
> --- a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> +++ b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> @@ -45,7 +45,7 @@ set id_re "id=\"$decimal\""
>  set type_re "type=\"process\""
>  set description_re "description=\"$string_re\""
>  set user_re "user=\"$string_re\""
> -set cores_re "cores=\\\[\"$decimal\"(,\"$decimal\")*\\\]"
> +set cores_re "cores=\\\[(\"$decimal\"(,\"$decimal\")*)?\\\]"
> 
>  # List all available processes.
>  set process_entry_re
> "{${id_re},${type_re}(,$description_re)?(,$user_re)?(,$cores_re)?}"

Hi Andrew,

The patch LGTM.  I manually reproduced this case by spawning a process 
(tail -f /dev/null) and noting its pid.  In linux_xfer_osdata_processes, 
I added:

   if (pid == <pid>)
     sleep (5);

and killing the process during that sleep.

Simon
  
Pedro Alves Aug. 13, 2018, 9:51 a.m. UTC | #2
On 08/10/2018 10:26 PM, Simon Marchi wrote:
> On 2018-08-10 05:57, Andrew Burgess wrote:
>> In the test script gdb.mi/list-thread-groups-available.exp we ask GDB
>> to list all thread groups, and match the output against a
>> regexp. Occasionally, I would see this test fail.
>>
>> The expected output is a list of entries, each entry looking roughly
>> like this:
>>
>>   {id="<DECIMAL>",type="process",description="<STRING>",
>>    user="<STRING>",cores=["<DECIMAL>","<DECIMAL>",...]}
>>
>> All the fields after 'id' and 'type' are optional, and the 'cores'
>> list can contain 1 or more "<DECIMAL>" entries.
>>
>> On my machine (Running Fedora 27, kernel 4.17.3-100.fc27.x86_64)
>> usually the 'description' is a non-empty string, and the 'cores' list
>> has at least one entry in it.  But sometimes, very rarely, I'll see an
>> entry in the process group list where the 'description' is an empty
>> string, the 'user' is the string "?", and the 'cores' list is empty.
>> Such an entry looks like this:
>>
>>    {id="19863",type="process",description="",user="?",cores=[]}
>>
>> I believe that this is caused by the process exiting while GDB is
>> scanning /proc for process information.  The current code in
>> gdb/nat/linux-osdata.c is not (I think) resilient against exiting
>> processes.
>>
>> This commit adjusts the regex that matches the 'cores' list so that an
>> empty list is acceptable, with this patch in place the test script
>> gdb.mi/list-thread-groups-available.exp never fails for me now.
>>
>> gdb/testsuite/ChangeLog:
>>
>>     * gdb.mi/list-thread-groups-available.exp: Update test regexp.
>> ---
>>  gdb/testsuite/ChangeLog                               | 4 ++++
>>  gdb/testsuite/gdb.mi/list-thread-groups-available.exp | 2 +-
>>  2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
>> b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
>> index c4dab2a2c34..88f9ee9b63d 100644
>> --- a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
>> +++ b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
>> @@ -45,7 +45,7 @@ set id_re "id=\"$decimal\""
>>  set type_re "type=\"process\""
>>  set description_re "description=\"$string_re\""
>>  set user_re "user=\"$string_re\""
>> -set cores_re "cores=\\\[\"$decimal\"(,\"$decimal\")*\\\]"
>> +set cores_re "cores=\\\[(\"$decimal\"(,\"$decimal\")*)?\\\]"
>>
>>  # List all available processes.
>>  set process_entry_re
>> "{${id_re},${type_re}(,$description_re)?(,$user_re)?(,$cores_re)?}"
> 
> Hi Andrew,
> 
> The patch LGTM.  I manually reproduced this case by spawning a process (tail -f /dev/null) and noting its pid.  In linux_xfer_osdata_processes, I added:
> 
>   if (pid == <pid>)
>     sleep (5);
> 
> and killing the process during that sleep.

But shouldn't we make GDB handle this better?  Make the output
more "atomic" in the sense that we either show a valid complete
entry, or no entry?  There's an inherent race
here, since we use multiple /proc accesses to fill up a process
entry.  If we start fetching process info for a process, and the process
disappears midway, I'd think it better to discard that process's entry,
as-if we had not even seen it, i.e., as if we had listed the set of
processes a tiny moment later.

Thanks,
Pedro Alves
  
Andrew Burgess Aug. 13, 2018, 11:41 a.m. UTC | #3
* Pedro Alves <palves@redhat.com> [2018-08-13 10:51:44 +0100]:

> On 08/10/2018 10:26 PM, Simon Marchi wrote:
> > On 2018-08-10 05:57, Andrew Burgess wrote:
> >> In the test script gdb.mi/list-thread-groups-available.exp we ask GDB
> >> to list all thread groups, and match the output against a
> >> regexp. Occasionally, I would see this test fail.
> >>
> >> The expected output is a list of entries, each entry looking roughly
> >> like this:
> >>
> >>   {id="<DECIMAL>",type="process",description="<STRING>",
> >>    user="<STRING>",cores=["<DECIMAL>","<DECIMAL>",...]}
> >>
> >> All the fields after 'id' and 'type' are optional, and the 'cores'
> >> list can contain 1 or more "<DECIMAL>" entries.
> >>
> >> On my machine (Running Fedora 27, kernel 4.17.3-100.fc27.x86_64)
> >> usually the 'description' is a non-empty string, and the 'cores' list
> >> has at least one entry in it.  But sometimes, very rarely, I'll see an
> >> entry in the process group list where the 'description' is an empty
> >> string, the 'user' is the string "?", and the 'cores' list is empty.
> >> Such an entry looks like this:
> >>
> >>    {id="19863",type="process",description="",user="?",cores=[]}
> >>
> >> I believe that this is caused by the process exiting while GDB is
> >> scanning /proc for process information.  The current code in
> >> gdb/nat/linux-osdata.c is not (I think) resilient against exiting
> >> processes.
> >>
> >> This commit adjusts the regex that matches the 'cores' list so that an
> >> empty list is acceptable, with this patch in place the test script
> >> gdb.mi/list-thread-groups-available.exp never fails for me now.
> >>
> >> gdb/testsuite/ChangeLog:
> >>
> >>     * gdb.mi/list-thread-groups-available.exp: Update test regexp.
> >> ---
> >>  gdb/testsuite/ChangeLog                               | 4 ++++
> >>  gdb/testsuite/gdb.mi/list-thread-groups-available.exp | 2 +-
> >>  2 files changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> >> b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> >> index c4dab2a2c34..88f9ee9b63d 100644
> >> --- a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> >> +++ b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> >> @@ -45,7 +45,7 @@ set id_re "id=\"$decimal\""
> >>  set type_re "type=\"process\""
> >>  set description_re "description=\"$string_re\""
> >>  set user_re "user=\"$string_re\""
> >> -set cores_re "cores=\\\[\"$decimal\"(,\"$decimal\")*\\\]"
> >> +set cores_re "cores=\\\[(\"$decimal\"(,\"$decimal\")*)?\\\]"
> >>
> >>  # List all available processes.
> >>  set process_entry_re
> >> "{${id_re},${type_re}(,$description_re)?(,$user_re)?(,$cores_re)?}"
> > 
> > Hi Andrew,
> > 
> > The patch LGTM.  I manually reproduced this case by spawning a process (tail -f /dev/null) and noting its pid.  In linux_xfer_osdata_processes, I added:
> > 
> >   if (pid == <pid>)
> >     sleep (5);
> > 
> > and killing the process during that sleep.
> 
> But shouldn't we make GDB handle this better?  Make the output
> more "atomic" in the sense that we either show a valid complete
> entry, or no entry?  There's an inherent race
> here, since we use multiple /proc accesses to fill up a process
> entry.  If we start fetching process info for a process, and the process
> disappears midway, I'd think it better to discard that process's entry,
> as-if we had not even seen it, i.e., as if we had listed the set of
> processes a tiny moment later.

I agree.

We also need to think about process reuse.  So with multiple accesses
to /proc we might start with one process, and end up with a completely
new process.

I might be overthinking it, but my first guess at a reliable strategy
would be:

  1. Find each /proc/PID directory.
  2. Read /proc/PID/stat and extract the start time.  Failure to read
     this causes the process to be abandoned.
  3. Read all of the other /proc/PID/XXX files as needed.  Any failure
     results in the process being abandoned.
  4. Reread /proc/PID/stat and confirm the start time hasn't changed,
     this would indicate a new process having slipped in.

Given the system is still running, we can never be sure that we have
"all" processes, so throwing out anything that looks wrong seems like
the right strategy.

Also in step #4 we know we've just missed a process - something new
has started, but we ignore it.  I think this is fine though given the
racy nature of this sort of thing...

The only question is, could these thoughts be dropped into a bug
report, and the original patch to remove the unstable result applied?
Or maybe the test updated to either PASS or KFAIL?

Thanks,
Andrew
  
Pedro Alves Aug. 13, 2018, 12:03 p.m. UTC | #4
On 08/13/2018 12:41 PM, Andrew Burgess wrote:
> * Pedro Alves <palves@redhat.com> [2018-08-13 10:51:44 +0100]:
> 
>> But shouldn't we make GDB handle this better?  Make the output
>> more "atomic" in the sense that we either show a valid complete
>> entry, or no entry?  There's an inherent race
>> here, since we use multiple /proc accesses to fill up a process
>> entry.  If we start fetching process info for a process, and the process
>> disappears midway, I'd think it better to discard that process's entry,
>> as-if we had not even seen it, i.e., as if we had listed the set of
>> processes a tiny moment later.
> 
> I agree.
> 
> We also need to think about process reuse.  So with multiple accesses
> to /proc we might start with one process, and end up with a completely
> new process.
> 
> I might be overthinking it, but my first guess at a reliable strategy
> would be:
> 
>   1. Find each /proc/PID directory.
>   2. Read /proc/PID/stat and extract the start time.  Failure to read
>      this causes the process to be abandoned.
>   3. Read all of the other /proc/PID/XXX files as needed.  Any failure
>      results in the process being abandoned.
>   4. Reread /proc/PID/stat and confirm the start time hasn't changed,
>      this would indicate a new process having slipped in.
> 

My initial quick thought was just to drop the process entry if
it turns out we end up with an empty core set.  

I wonder whether we can prevent PID reuse by keeping a descriptor
for /proc/PID/ open while we open the other files.  Probably not.
Otherwise, your scheme sounds like the next best.

> Given the system is still running, we can never be sure that we have
> "all" processes, so throwing out anything that looks wrong seems like
> the right strategy.
> 
> Also in step #4 we know we've just missed a process - something new
> has started, but we ignore it.  I think this is fine though given the
> racy nature of this sort of thing...
> 
> The only question is, could these thoughts be dropped into a bug
> report, 


Sure.


> and the original patch to remove the unstable result applied?
> Or maybe the test updated to either PASS or KFAIL?

I'd prefer the KFAIL option.  At the very least, a comment in
the .exp file.

Thanks,
Pedro Alves
  
Andrew Burgess Aug. 13, 2018, 1:01 p.m. UTC | #5
* Pedro Alves <palves@redhat.com> [2018-08-13 13:03:47 +0100]:

> On 08/13/2018 12:41 PM, Andrew Burgess wrote:
> > * Pedro Alves <palves@redhat.com> [2018-08-13 10:51:44 +0100]:
> > 
> >> But shouldn't we make GDB handle this better?  Make the output
> >> more "atomic" in the sense that we either show a valid complete
> >> entry, or no entry?  There's an inherent race
> >> here, since we use multiple /proc accesses to fill up a process
> >> entry.  If we start fetching process info for a process, and the process
> >> disappears midway, I'd think it better to discard that process's entry,
> >> as-if we had not even seen it, i.e., as if we had listed the set of
> >> processes a tiny moment later.
> > 
> > I agree.
> > 
> > We also need to think about process reuse.  So with multiple accesses
> > to /proc we might start with one process, and end up with a completely
> > new process.
> > 
> > I might be overthinking it, but my first guess at a reliable strategy
> > would be:
> > 
> >   1. Find each /proc/PID directory.
> >   2. Read /proc/PID/stat and extract the start time.  Failure to read
> >      this causes the process to be abandoned.
> >   3. Read all of the other /proc/PID/XXX files as needed.  Any failure
> >      results in the process being abandoned.
> >   4. Reread /proc/PID/stat and confirm the start time hasn't changed,
> >      this would indicate a new process having slipped in.
> > 
> 
> My initial quick thought was just to drop the process entry if
> it turns out we end up with an empty core set.  
> 
> I wonder whether we can prevent PID reuse by keeping a descriptor
> for /proc/PID/ open while we open the other files.  Probably not.

That was my first though, I tried:

  - chdir /proc/PID
  - opendir for /proc/PID

  - Kill /proc/PID

  - Read from the opendir handle, find nothing there.

Which didn't really surprise me, but was worth a try...

> Otherwise, your scheme sounds like the next best.
> 
> > Given the system is still running, we can never be sure that we have
> > "all" processes, so throwing out anything that looks wrong seems like
> > the right strategy.
> > 
> > Also in step #4 we know we've just missed a process - something new
> > has started, but we ignore it.  I think this is fine though given the
> > racy nature of this sort of thing...
> > 
> > The only question is, could these thoughts be dropped into a bug
> > report, 
> 
> 
> Sure.
> 
> 
> > and the original patch to remove the unstable result applied?
> > Or maybe the test updated to either PASS or KFAIL?
> 
> I'd prefer the KFAIL option.  At the very least, a comment in
> the .exp file.

I'll put something together...

Thanks,
Andrew
  
Pedro Alves Aug. 13, 2018, 1:38 p.m. UTC | #6
On 08/13/2018 02:01 PM, Andrew Burgess wrote:
> * Pedro Alves <palves@redhat.com> [2018-08-13 13:03:47 +0100]:

>> I wonder whether we can prevent PID reuse by keeping a descriptor
>> for /proc/PID/ open while we open the other files.  Probably not.
> 
> That was my first though, I tried:
> 
>   - chdir /proc/PID
>   - opendir for /proc/PID
> 
>   - Kill /proc/PID
> 
>   - Read from the opendir handle, find nothing there.
> 
> Which didn't really surprise me, but was worth a try...

Does it return "nothing else" even if you don't kill
the process?  Or does returning nothing indicate the
process is gone already?

Regardless, I don't think that proves keeping the opendir dir handle
open (or some other file under /proc/PID) does not prevent the kernel
from reusing the PID until the handle is closed, even though I
do suspect it does not.

But thinking a bit more, maybe it's useless to try to detect PID reuse,
because the process we're collecting info for can just as well exec,
which makes the info we had collected so far become invalid in pretty
much the same way...

>>> and the original patch to remove the unstable result applied?
>>> Or maybe the test updated to either PASS or KFAIL?
>>
>> I'd prefer the KFAIL option.  At the very least, a comment in
>> the .exp file.
> 
> I'll put something together...

Maybe it's not worth the bother.  After thinking about it some more,
I'll be happy with a comment in the .exp file.

Pedro Alves
  

Patch

diff --git a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
index c4dab2a2c34..88f9ee9b63d 100644
--- a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
+++ b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
@@ -45,7 +45,7 @@  set id_re "id=\"$decimal\""
 set type_re "type=\"process\""
 set description_re "description=\"$string_re\""
 set user_re "user=\"$string_re\""
-set cores_re "cores=\\\[\"$decimal\"(,\"$decimal\")*\\\]"
+set cores_re "cores=\\\[(\"$decimal\"(,\"$decimal\")*)?\\\]"
 
 # List all available processes.
 set process_entry_re "{${id_re},${type_re}(,$description_re)?(,$user_re)?(,$cores_re)?}"