pthread stack cache and valgrind helgrind false positive race error on tls variables

  On Sat, 2014-07-19 at 17:02 -0700, Roland McGrath wrote:
> Separate from the implementation details of the workaround you propose, I
> think we would like to understand more clearly what is going on to confuse
> helgrind.
I have attached the (under review) valgrind patch.
Whatever better solution we might have in a future glibc (see below
discussion), we should have a solution for old/current glibc.
Note that this patch applies to the valgrind SVN (it also contains the
test program that shows the tls false positive).

> Though it's all unclear to me so far, I tend to suspect both
> that this workaround is not what we'd consider the ideal fix and that the
> underlying problem is likely to have other symptoms (for other parts of the
> code, other usage patterns, and/or other code outside of libc that rolls
> its own synchronization primitives).
Yes, helgrind has several tricks (kludges?) to avoid raising false
positive in various "low level" code:
  * ld.so is not instrumented, otherwise this causes huge nr of false
    positive to (expensively) suppress.
  * the default valgrind error suppression file contains very general
    suppression entries, suppressing all races that have their top frame
    in libc or libpthread.
    Search for 'not very clever' or 'ugly' in valgrind default.supp.
    A.o., default.supp mentions that helgrind does not understand
    stdio locking.
    The 'not clever/ugly' is mostly because these supp can cause
    false negative.
  * several more precise suppression entries for various false +
    in glibc and others.

The above techniques do not work with the false positive caused
by the stack cache/tls internal structures, as (some of) the races
appears in user code. Also, pthread cond var cannot be properly
handled by helgrind.

For sure, it would be nice to have a general solution for all that
(or at least a solution that decreases the nr of needed kludges
in helgrind).

An approach could be to add in glibc specific helgrind requests to
describe the low level synchronisation primitives 
(e.g. describe the 'create lock/lock/unlock'operations).
This has some drawbacks:
* it introduces a few instructions in the 'hot path' of the low level
  glibc primitives.
  (this might imply to have library versions compiled with/without
   these requests, which makes it more cumbersome for the end user,
   which has then to point at the good library when using helgrind).
* Such annotations are valgrind specific. They will probably work also
  for the other valgrind race detector (drd). But there are other 
  race detection tools which might have similar difficulties.
* It is not sure that all helgrind kludges can be eliminated by
  using these requests.

Someone on valgrind dev mailing list suggested to have specific debug
information used to mark synchronisation primitives.
Interesting idea, but not sure if/how that can work
(and that is for sure a very long term solution).

In summary, have a better support for race detection tools in glibc
is attractive, but looks not trivial.

So, several various things can be done:
* short term: the valgrind hack patch to disable stack cache in
  to support old and current glibc
* when glibc provides a tuning infrastructure, replace the hack
  by a cleaner way to disable the cache
* more longer term: see how glibc could "describe" its behaviour
  to race detection tools
* Maybe at somewhat shorter term, solve the problem of pthread cond var,
  for which there is currently no known hack/kludge
  (see helgrind user manual section 7.5 for a description of the
   difficulty)

Thanks

Philippe

pthread stack cache and valgrind helgrind false positive race error on tls variables

Commit Message

Comments

Patch