Workaround a FreeBSD ptrace() bug with clearing thread events.

Message ID 20180224000935.43344-1-jhb@FreeBSD.org
State New, archived
Headers

Commit Message

John Baldwin Feb. 24, 2018, 12:09 a.m. UTC
  When multiple threads within a process wish to report STOPPED events
from wait(), the kernel picks one thread event as the thread event to
report.  The chosen thread event is retrieved via PT_LWPINFO by
passing the process ID as the request pid.  If multiple events are
pending, then the subsequent wait() after resuming a process will
report another STOPPED event after resuming the process to handle the
next thread event and so on.

A single thread event is cleared as a side effect of resuming the
process with PT_CONTINUE, PT_STEP, etc.  In older kernels, however,
the request pid was used to select which thread's event was cleared
rather than always clearing the event that was just reported.  To
avoid clearing the event of the wrong LWP, always pass the process ID
instead of an LWP ID to PT_CONTINUE or PT_SYSCALL.

In the case of stepping, the process ID cannot be used with PT_STEP
since it would step the thread that reported an event which may not be
the thread indicated by PTID.  For stepping, use PT_SETSTEP to enable
stepping on the desired thread before resuming the process via
PT_CONTINUE instead of using PT_STEP.

This manifested as a failure in the
gdb.threads/continue-pending-status.exp test.  Specifically, if thread
2 reported a breakpoint and the test thus switched to thread 3 before
continuing, thread 3's event (if any) was discarded and thread 2's
breakpoint remained pending and was reported a second time as a
duplicate event.  As a result, the PC was decremented twice for the
same breakpoint resulting in an illegal instruction fault on x86.

gdb/ChangeLog:

	* fbsd-nat.c (fbsd_resume): Use PT_SETSTEP for stepping and a
	wildcard process pid for super_resume for kernels with a
	specific bug.
---
 gdb/ChangeLog  |  6 ++++++
 gdb/fbsd-nat.c | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)
  

Comments

Joel Brobecker Feb. 26, 2018, 4:27 a.m. UTC | #1
Hi John,

> gdb/ChangeLog:
> 
> 	* fbsd-nat.c (fbsd_resume): Use PT_SETSTEP for stepping and a
> 	wildcard process pid for super_resume for kernels with a
> 	specific bug.

Just a small coding style nit:

> +#if __FreeBSD_version < 1200052
> +  /*
> +   * When multiple threads within a process wish to report STOPPED
> +   * events from wait(), the kernel picks one thread event as the

In the GDB project, we do not use the '*' at the start of every line.
See:

https://sourceware.org/gdb/wiki/Internals%20GDB-C-Coding-Standards#Block_Comment_Formatting
  
John Baldwin Feb. 26, 2018, 5:25 p.m. UTC | #2
On Monday, February 26, 2018 08:27:54 AM Joel Brobecker wrote:
> Hi John,
> 
> > gdb/ChangeLog:
> > 
> > 	* fbsd-nat.c (fbsd_resume): Use PT_SETSTEP for stepping and a
> > 	wildcard process pid for super_resume for kernels with a
> > 	specific bug.
> 
> Just a small coding style nit:
> 
> > +#if __FreeBSD_version < 1200052
> > +  /*
> > +   * When multiple threads within a process wish to report STOPPED
> > +   * events from wait(), the kernel picks one thread event as the
> 
> In the GDB project, we do not use the '*' at the start of every line.
> See:
> 
> https://sourceware.org/gdb/wiki/Internals%20GDB-C-Coding-Standards#Block_Comment_Formatting

Oops, yes, I'll fix.
  
Maciej W. Rozycki March 2, 2018, 12:13 a.m. UTC | #3
On Fri, 23 Feb 2018, John Baldwin wrote:

> diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
> index d44950618c..9c87bfed33 100644
> --- a/gdb/fbsd-nat.c
> +++ b/gdb/fbsd-nat.c
> @@ -1163,6 +1163,39 @@ fbsd_resume (struct target_ops *ops,
>  	}
>        ptid = inferior_ptid;
>      }
> +
> +#if __FreeBSD_version < 1200052
> +  /*
> +   * When multiple threads within a process wish to report STOPPED
> +   * events from wait(), the kernel picks one thread event as the
> +   * thread event to report.  The chosen thread event is retrieved via
> +   * PT_LWPINFO by passing the process ID as the request pid.  If
> +   * multiple events are pending, then the subsequent wait() after
> +   * resuming a process will report another STOPPED event after
> +   * resuming the process to handle the next thread event and so on.
> +   *
> +   * A single thread event is cleared as a side effect of resuming the
> +   * process with PT_CONTINUE, PT_STEP, etc.  In older kernels,
> +   * however, the request pid was used to select which thread's event
> +   * was cleared rather than always clearing the event that was just
> +   * reported.  To avoid clearing the event of the wrong LWP, always
> +   * pass the process ID instead of an LWP ID to PT_CONTINUE or
> +   * PT_SYSCALL.

 Hmm, doesn't it have to be a run-time check then?  Otherwise you're 
basing your decision on the host system GDB has been built for and not one 
it will be run on, which I suppose does not necessarily have to be of the 
same version.  Or am I missing anything here?

  Maciej
  
John Baldwin March 2, 2018, 6:28 p.m. UTC | #4
On Friday, March 02, 2018 12:13:20 AM Maciej W. Rozycki wrote:
> On Fri, 23 Feb 2018, John Baldwin wrote:
> 
> > diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
> > index d44950618c..9c87bfed33 100644
> > --- a/gdb/fbsd-nat.c
> > +++ b/gdb/fbsd-nat.c
> > @@ -1163,6 +1163,39 @@ fbsd_resume (struct target_ops *ops,
> >  	}
> >        ptid = inferior_ptid;
> >      }
> > +
> > +#if __FreeBSD_version < 1200052
> > +  /*
> > +   * When multiple threads within a process wish to report STOPPED
> > +   * events from wait(), the kernel picks one thread event as the
> > +   * thread event to report.  The chosen thread event is retrieved via
> > +   * PT_LWPINFO by passing the process ID as the request pid.  If
> > +   * multiple events are pending, then the subsequent wait() after
> > +   * resuming a process will report another STOPPED event after
> > +   * resuming the process to handle the next thread event and so on.
> > +   *
> > +   * A single thread event is cleared as a side effect of resuming the
> > +   * process with PT_CONTINUE, PT_STEP, etc.  In older kernels,
> > +   * however, the request pid was used to select which thread's event
> > +   * was cleared rather than always clearing the event that was just
> > +   * reported.  To avoid clearing the event of the wrong LWP, always
> > +   * pass the process ID instead of an LWP ID to PT_CONTINUE or
> > +   * PT_SYSCALL.
> 
>  Hmm, doesn't it have to be a run-time check then?  Otherwise you're 
> basing your decision on the host system GDB has been built for and not one 
> it will be run on, which I suppose does not necessarily have to be of the 
> same version.  Or am I missing anything here?

FreeBSD generally does not support forwards-compatability for binaries (newer
binary on older kernel), only backwards-compatability (older binary on newer
kernel).  In this case, using the workaround is also fine on a fixed kernel,
so it doesn't hurt if GDB is compiled on an older system (thus using the
workaround) and then run under a newer kernel.
  
Maciej W. Rozycki March 2, 2018, 7:31 p.m. UTC | #5
On Fri, 2 Mar 2018, John Baldwin wrote:

> >  Hmm, doesn't it have to be a run-time check then?  Otherwise you're 
> > basing your decision on the host system GDB has been built for and not one 
> > it will be run on, which I suppose does not necessarily have to be of the 
> > same version.  Or am I missing anything here?
> 
> FreeBSD generally does not support forwards-compatability for binaries (newer
> binary on older kernel), only backwards-compatability (older binary on newer
> kernel).  In this case, using the workaround is also fine on a fixed kernel,
> so it doesn't hurt if GDB is compiled on an older system (thus using the
> workaround) and then run under a newer kernel.

 Fair enough.

 Is the one-way compatibility enforced though, by a system library runtime 
or the kernel somehow, by refusing to run a binary built for a kernel that 
is newer than one currently in charge of the system?  Otherwise the rule 
would be quite fragile and error prone, asking for extra care to be taken 
by the user.

  Maciej
  
John Baldwin March 2, 2018, 10:34 p.m. UTC | #6
On Friday, March 02, 2018 07:31:49 PM Maciej W. Rozycki wrote:
> On Fri, 2 Mar 2018, John Baldwin wrote:
> 
> > >  Hmm, doesn't it have to be a run-time check then?  Otherwise you're 
> > > basing your decision on the host system GDB has been built for and not one 
> > > it will be run on, which I suppose does not necessarily have to be of the 
> > > same version.  Or am I missing anything here?
> > 
> > FreeBSD generally does not support forwards-compatability for binaries (newer
> > binary on older kernel), only backwards-compatability (older binary on newer
> > kernel).  In this case, using the workaround is also fine on a fixed kernel,
> > so it doesn't hurt if GDB is compiled on an older system (thus using the
> > workaround) and then run under a newer kernel.
> 
>  Fair enough.
> 
>  Is the one-way compatibility enforced though, by a system library runtime 
> or the kernel somehow, by refusing to run a binary built for a kernel that 
> is newer than one currently in charge of the system?  Otherwise the rule 
> would be quite fragile and error prone, asking for extra care to be taken 
> by the user.

It is enforced in some ways but not others.  Kernel modules do depend on a
version number in such a way that attempting to load a newer kernel module
on an older kernel will fail.  However, the general policy of only supporting
one-way compatibility is well-known among the FreeBSD userbase (for example,
the instructions for upgrading a system from source require booting into a
new kernel before installing the matching userland binaries).
  
Maciej W. Rozycki March 3, 2018, 5:42 p.m. UTC | #7
On Fri, 2 Mar 2018, John Baldwin wrote:

> >  Is the one-way compatibility enforced though, by a system library runtime 
> > or the kernel somehow, by refusing to run a binary built for a kernel that 
> > is newer than one currently in charge of the system?  Otherwise the rule 
> > would be quite fragile and error prone, asking for extra care to be taken 
> > by the user.
> 
> It is enforced in some ways but not others.  Kernel modules do depend on a
> version number in such a way that attempting to load a newer kernel module
> on an older kernel will fail.  However, the general policy of only supporting
> one-way compatibility is well-known among the FreeBSD userbase (for example,
> the instructions for upgrading a system from source require booting into a
> new kernel before installing the matching userland binaries).

 That all sounds right, however does not really address my concern where a 
user *unknowingly* tries a program binary that has been compiled for a 
newer kernel version and then faces all kinds of issues.  This is bound to 
happen sooner or later for someone, the Murphy's law guarantees it.

  Maciej
  

Patch

diff --git a/gdb/ChangeLog b/gdb/ChangeLog
index d661310b07..3db2f544bf 100644
--- a/gdb/ChangeLog
+++ b/gdb/ChangeLog
@@ -1,3 +1,9 @@ 
+2017-09-22  John Baldwin  <jhb@FreeBSD.org>
+
+	* fbsd-nat.c (fbsd_resume): Use PT_SETSTEP for stepping and a
+	wildcard process pid for super_resume for kernels with a
+	specific bug.
+
 2017-09-22  John Baldwin  <jhb@FreeBSD.org>
 
 	* fbsd-nat.c: Include "inf-ptrace.h".
diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
index d44950618c..9c87bfed33 100644
--- a/gdb/fbsd-nat.c
+++ b/gdb/fbsd-nat.c
@@ -1163,6 +1163,39 @@  fbsd_resume (struct target_ops *ops,
 	}
       ptid = inferior_ptid;
     }
+
+#if __FreeBSD_version < 1200052
+  /*
+   * When multiple threads within a process wish to report STOPPED
+   * events from wait(), the kernel picks one thread event as the
+   * thread event to report.  The chosen thread event is retrieved via
+   * PT_LWPINFO by passing the process ID as the request pid.  If
+   * multiple events are pending, then the subsequent wait() after
+   * resuming a process will report another STOPPED event after
+   * resuming the process to handle the next thread event and so on.
+   *
+   * A single thread event is cleared as a side effect of resuming the
+   * process with PT_CONTINUE, PT_STEP, etc.  In older kernels,
+   * however, the request pid was used to select which thread's event
+   * was cleared rather than always clearing the event that was just
+   * reported.  To avoid clearing the event of the wrong LWP, always
+   * pass the process ID instead of an LWP ID to PT_CONTINUE or
+   * PT_SYSCALL.
+   *
+   * In the case of stepping, the process ID cannot be used with
+   * PT_STEP since it would step the thread that reported an event
+   * which may not be the thread indicated by PTID.  For stepping, use
+   * PT_SETSTEP to enable stepping on the desired thread before
+   * resuming the process via PT_CONTINUE instead of using PT_STEP.
+   */
+  if (step)
+    {
+      if (ptrace (PT_SETSTEP, get_ptrace_pid (ptid), NULL, 0) == -1)
+	perror_with_name (("ptrace"));
+      step = 0;
+    }
+  ptid = ptid_t (ptid.pid ());
+#endif
   super_resume (ops, ptid, step, signo);
 }