Message ID | 569E8050.20606@arm.com |
---|---|
State | New, archived |
Headers |
Received: (qmail 26096 invoked by alias); 19 Jan 2016 18:28:59 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 26073 invoked by uid 89); 19 Jan 2016 18:28:57 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=AWL, BAYES_00, SPF_PASS, UNSUBSCRIBE_BODY autolearn=no version=3.3.2 spammy=Bump, writers, uptodate, up-to-date X-HELO: eu-smtp-delivery-143.mimecast.com Message-ID: <569E8050.20606@arm.com> Date: Tue, 19 Jan 2016 18:28:32 +0000 From: Szabolcs Nagy <szabolcs.nagy@arm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.8.0 MIME-Version: 1.0 To: Carlos O'Donell <carlos@redhat.com>, GNU C Library <libc-alpha@sourceware.org> CC: <i.palachev@samsung.com>, "triegel@redhat.com" <triegel@redhat.com>, <nd@arm.com> Subject: Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen References: <568D5E11.3010301@arm.com> <5693D908.8090104@arm.com> <5695B7BF.7050000@redhat.com> In-Reply-To: <5695B7BF.7050000@redhat.com> X-ClientProxiedBy: DB4PR02CA0027.eurprd02.prod.outlook.com (10.242.174.155) To DB5PR08MB0840.eurprd08.prod.outlook.com (25.164.43.18) X-Microsoft-Exchange-Diagnostics: 1; DB5PR08MB0840; 2:skhsKU25PDvtjir1CjyEx0+NlWrqxOkJyXolQaYsOkdBnX7wu3xyTKIW/pBnHj5/RN+dUYpd9FLGvzMLpqaq6kTFURUp+0ZQILGxRVMwU1McbB2Owe70JeMRylHzawZkVe/qzslvXBng/ewj4rBWew==; 3:WDEj7YRxM8jvU+8BUDClrHvMSWlLiQdU7x0xHHBfNeLAVy0C8BDNsCTXl++kJJABXRft5bw6Rj9s3x0f/hq90h8wpzCnDNiTZQFzcIz4Dbv1bjj2BPeqT1Hw7kKqDYDK; 25:2CWHuY3jhUBjNPSw3iBRWlU+9kMg6b3pxnpKjSEKzGvmDW/wfGSJmIFvi0vxelhtvc87zIlK0dXQp5Z+o1Q20Lg+H4w2YWCn/riDiOk7+o5FGQY+GK9UfUfU1R/F5o1k1VGT8OTeeJEaMfasos01QA+vsKHP9Df912SpKvQ1Bm9V1h28P8fdhAgabncWb5WNjzNXGOh6CReaq5VVCj3Ym8mD77wz7OMUkIfNwe7izV8ISt2lQwViK+/llSgjJL8D; 20:aTUefFXNutjWL6Y1jwe/kIeXauVLq8POqqgZDgVXTT+eSOv4J6YX15N3uV2UIbBPTZFlptkBHQ0FDX0gpsoKNYZpcwH6RPYQSPL6A8t7rqcbuZ5ycRNGy3FmHFdMLX3kLxtUJVbU5ORugbrtHS3LJtQP8hVl53xP0ACSOEC/GYY= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DB5PR08MB0840; X-MS-Office365-Filtering-Correlation-Id: baef1cc3-0e55-4bd8-952c-08d320fe5847 NoDisclaimer: True X-Microsoft-Antispam-PRVS: <DB5PR08MB08401CBB3CE1C1E518A81391EDC10@DB5PR08MB0840.eurprd08.prod.outlook.com> X-Exchange-Antispam-Report-Test: UriScan:(180628864354917); X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(102615245)(601004)(2401047)(5005006)(8121501046)(520078)(10201501046)(3002001); SRVR:DB5PR08MB0840; BCL:0; PCL:0; RULEID:; SRVR:DB5PR08MB0840; X-Microsoft-Exchange-Diagnostics: 1; DB5PR08MB0840; 4:is58cEGfK/btuP3ya1bv4mU3cNAZUbep1iWwrEne9+NwqIiLFxaQnitdumeF88OrszRr4GigDuDTF/4GAXTT8WreCyrCRoyqGyxqclpgrUDTieTf2EaLoYDjSP5EBd+6BnI8aTlk65JLRVEKNOmCTzh2XedMahuPi1985cLYxGskmaAqOvpL2FfHB0xjkSImLlP65S6MUjbpZycflasw2YT1cojb3jqz1kD2H+e6XsHWrQEsD+XJD7IsrXUcYchs2W50U/7a7HST58U/hvWWKovh81PJBCOJ63LeHLxQboRwuSo8WTTOQrg9Vho/Y9q9lWxiSbqVsdGALfPIJ34AppXG5A8aLyd2jekYPJECsBFHIFoV7DY/O0q1urZkqTjNqNcqQmMRigJ8GQR4DEtOh9SAgK+8gcWo4SOA4zZYp6Gd1mPFTrC/YKSeluF6l7puRZSvfg17z4YObsCJ7ixuPA== X-Forefront-PRVS: 0826B2F01B X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009020)(6049001)(6009001)(199003)(24454002)(479174004)(377424004)(377454003)(189002)(65816999)(270700001)(4326007)(105586002)(64126003)(5008740100001)(65806001)(2950100001)(33656002)(92566002)(5004730100002)(189998001)(1096002)(77096005)(21490400002)(36756003)(586003)(568964002)(4610100001)(84326002)(5890100001)(122386002)(83506001)(19580395003)(40100003)(50986999)(76176999)(512874002)(6116002)(86362001)(19580405001)(101416001)(42186005)(5001770100001)(2906002)(3846002)(87976001)(66066001)(97736004)(106356001)(54356999)(2476003)(5001960100002)(4001350100001)(81156007)(65956001); DIR:OUT; SFP:1101; SCL:1; SRVR:DB5PR08MB0840; H:[10.2.206.73]; FPR:; SPF:None; PTR:InfoNoRecords; A:1; MX:1; LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; DB5PR08MB0840; 23:+0MfD3C0Efsdd+lCACsjseR2HMj5TKsYj5qR6ay+7?= =?us-ascii?Q?mPVRGsKmyW5DXfbTp5HvxaprY6OlXjXULEV9fqT3eH62fbfu3nHLLNMGrnsD?= =?us-ascii?Q?hHRfNdFvgH0lxUuBK1IDa33bNDllB+/NeSqgUfwosA0ujJDjKVjZp4djz5Gh?= =?us-ascii?Q?eFO3owA43VMj/YEZlXo19gFKCYRIOpW7suhHGgb+Da+m5XQs1pJA3yo45LiK?= =?us-ascii?Q?fGAfZtDICK9pRAbSByo4NX/PpcKphZTERrURWQ3jO+Nr0olvVnuXAR+laRVp?= =?us-ascii?Q?+V437HF9koAa4s1Wx1tkG7Evvozz74mJRiteiNvbpmn3MolEuVhMImBkBQz5?= =?us-ascii?Q?2fnALnFoJVKujf5bvl6IsAHU3To62tQGDUClh7GJmVIe1ndI4tSTUuOzZzT8?= =?us-ascii?Q?3kef76SBp7C6pH+KXSlPcqQHCfEfoWjm4ZIz01tl07Gp+lT2TgyXCPKtCWpX?= =?us-ascii?Q?JWp1BRxqlJLeA868b6TsPESFnyIxrfXYqZOCnANuPHQq1Oe1rrQ5UkKUmycy?= =?us-ascii?Q?U6BM+yjkbrdusG2VOtQPktpqM/EuB9jkkiSi/Os7Ma7HQlVf2GUAK8lsy8Zb?= =?us-ascii?Q?s2rqcvlh1GlTc5EFb3cl0Rcb+qakTi1SLuCKx33BdfIFg+ACSjewMD5ir9ud?= =?us-ascii?Q?HNHhiYCDr3kVVYFfTgYR4Zr5QMcp929FNpUJi+D7N9ucZ6eLEnaIogvl7+9n?= =?us-ascii?Q?DnXZttHthYtevrGi80o++Tj9QSivaW2ooOhUTTLNRlEF5dubQ5wasOdugnnG?= =?us-ascii?Q?cQhScTPkymTBJnzSvxykJ9NUoGRKdm/jbZvu/b334XUx1WlPpdLnm73xod0j?= =?us-ascii?Q?lx+w59a0RWyD45zYKUN0Qbqses4ttKww2ThdoaeEwX7mpr1yl1vU53Wuolnm?= =?us-ascii?Q?PNa3PJE1Pywcc5cPOdzFIuNApxpRGMc75Y5ycqKel/iXdXCPZteuyMbKmolN?= =?us-ascii?Q?iKfgpbGMpbbLPYGVUh8JR/weNDirK3sy38l1y32Qdzci0ouLbp1X4+2eEoIP?= =?us-ascii?Q?k2HXlGrpQT3XcC4dsiA2Sd6hc/ZGGgzVaQrpi8goL5oeDEPKzwSACheOlMrp?= =?us-ascii?Q?PiaZ+co8CH9uomhUIdh/qNYemYji75oi4/s4PJCJMuftBkjz5U8kEPPpIR4w?= =?us-ascii?Q?t26YH+ctF+7PAOMCOMQfjmz7BRkXjSRtsMZ7G+V/0EW0zsPG0AgeBWLyHeiG?= =?us-ascii?Q?Ou3olAqwpS+1eef2fVwdpVhNViHTAPqgVAsr2ozja1W1o+T+Q2YtqjfTqx5s?= =?us-ascii?Q?3+KRl1ogbCGHAqGJrNfx9CpLOEl1QSea/cmmpWxfP59yCyU+I5lXvR2+BHKv?= =?us-ascii?Q?pKmo3X/J9pDOIzT46tsm1+31oZpWBOiVdQf1DNLiuGN+xS3cuNUs2ZoLbyWy?= =?us-ascii?Q?W6fIFAu2fKR/WQV8Gj5aeOnXf30YF0Q9wOViuoPGhojE+/O?= X-Microsoft-Exchange-Diagnostics: 1; DB5PR08MB0840; 5:yuAJvMx4gBMkDpvKRtax6f+Vajfzd3zJe8Rhw90wXxsuzTomxWCrQkGEdrIaMsuLsELNyEQSVpBCLYjBYmSgGIcWaTazarbnjCp697eqRGeLMh2uOQs93sbhoDwR3u5OEBWprFZ7SDMy3irUzmhHgg==; 24:3Ahc9Hrzy05uR8eL343K9DDUW6FrkuGLw4P4B15PhiAXsRrt8nth9ycRGNA1ELt36Ocea2Eq5TP3QNeCpCxSJS52nXStPE86dZhoG96AoYs=; 20:48PNGnb9kB4gqL/+k6z2SapOFTmBUwzw3/sPUv9Bg2T9hL2dVPmr4y4LlHnNFKf0+0tkbl2elgM46QLekzPZ5iebtl37Pc8DYhWLxJJYLXUF92ZvgKR3uYfzqu/7ZhI68zkOVwB3JmDh3kMc6TKbFy9jT3Wcp7P6N3XW2aBSEDk= SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 19 Jan 2016 18:28:36.2521 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5PR08MB0840 X-MC-Unique: YqXM2JWIRISPyJL94e6J1g-1 Content-Type: multipart/mixed; boundary="------------060905080903030709020308" |
Commit Message
Szabolcs Nagy
Jan. 19, 2016, 6:28 p.m. UTC
On 13/01/16 02:34, Carlos O'Donell wrote: > On 01/11/2016 11:32 AM, Szabolcs Nagy wrote: >> 2016-01-11 Szabolcs Nagy <szabolcs.nagy@arm.com> >> >> [BZ #19329] >> * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation) >> atomically. >> * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation), >> GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically. >> (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next >> atomically. > > You are headed in the right direction. I like where this patch is going, > but don't have enough time to review this in detail yet. > > At first glance your patch lacks sufficient concurrency documentation to > be acceptable. You need to document which acquires the releases > synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level > of detail we need to maintain these kinds of changes. Patch v3: I added some concurrency notes. There are some TODO items left in the comments. I decided not to change __tls_get_addr and dlclose code paths, this patch only changes comments compared to v2. Overall the patch only turns some load/stores into atomic and changes their order (but not in a way that can be observed without concurrently running pthread_create and dlopen). Two branches are changed: if slotinfo[i].gen > GL(dl_tls_generation) then that entry is skipped instead of abort, and if listp->next == NULL then the iteration is finished instead of abort. These changes might not be entirely safe: dlclose is still not synced with pthread_create and there might be other bugs that can cause large slotinfo[i].gen or early list->next==NULL, then the current glibc code would abort and the new code will continue without noticing this. Note that the current asserts do not cover the concurrency bugs completely: concurrent dlclose can (even with the patch) lead to use after free, concurrently added modules can lead to read of uninitialized slotinfo entry where map is already non-null, but gen is 0 (and thus smaller than the current generation count) which can lead to out-of-bounds write to the DTV or other memory corruption. (If the asserts were more helpfully protecting against memory corruption, then i'd be reluctant removing them before a complete fix for the concurrency issues.) The acquire load of every slotinfo[i].map might be excessive use of atomics in _dl_allocate_tls_init, but i think it is necessary with the current design. I'm OK if this gets rescheduled after 2.23. 2016-01-19 Szabolcs Nagy <szabolcs.nagy@arm.com> [BZ #19329] * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation) atomically. * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation), GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically. (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next atomically.
Comments
I've just looked at your patch so far and haven't analyzed all the affected code. Nonetheless, a few comments below; so far, I would rather postpone this to 2.23 because you do change a few things and I don't see the level of detail in analysis and code comments that we'd ideally like to see for concurrency-related changes. On Tue, 2016-01-19 at 18:28 +0000, Szabolcs Nagy wrote: > diff --git a/elf/dl-open.c b/elf/dl-open.c > index 6f178b3..2b97605 100644 > --- a/elf/dl-open.c > +++ b/elf/dl-open.c > @@ -524,9 +524,16 @@ dl_open_worker (void *a) > } > > /* Bump the generation number if necessary. */ > - if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0)) > - _dl_fatal_printf (N_("\ > + if (any_tls) > + { > + size_t newgen = GL(dl_tls_generation) + 1; If you access a memory location with atomics somewhere, access it with atomics everywhere (initialization is typically an exception). This is important because it's our way to deal with not having atomic types right now (and so the compiler is aware, too), and because it's important for readers of the code because it alerts them that this is concurrent code. > + if (__builtin_expect (newgen == 0, 0)) > + _dl_fatal_printf (N_("\ > TLS generation counter wrapped! Please report this.")); > + /* Synchronize with the load acquire in _dl_allocate_tls_init. > + See the CONCURRENCY NOTES there in dl-tls.c. */ > + atomic_store_release (&GL(dl_tls_generation), newgen); Is there any reason you wouldn't want a release MO atomic increment here? > + } > > /* We need a second pass for static tls data, because _dl_update_slotinfo > must not be run while calls to _dl_add_to_slotinfo are still pending. */ > diff --git a/elf/dl-tls.c b/elf/dl-tls.c > index ed13fd9..7184a54 100644 > --- a/elf/dl-tls.c > +++ b/elf/dl-tls.c > @@ -443,6 +443,48 @@ _dl_resize_dtv (dtv_t *dtv) > } > > > +/* CONCURRENCY NOTES: This is a good start, but I think we need more detail, and we need to more carefully describe the high-level synchronization scheme (which we are still working on, of course). > + During dynamic TLS and DTV allocation and setup various objects may be > + accessed concurrently: > + > + GL(dl_tls_max_dtv_idx) > + GL(dl_tls_generation) > + listp->slotinfo[i].map > + listp->slotinfo[i].gen > + listp->next > + > + where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list. The public > + APIs that may access them are > + > + Writers: dlopen, dlclose and dynamic linker start up code. > + Readers only: pthread_create and __tls_get_addr (TLS access). I would first describe what these functions / features intend to do and what synchronization problem they have to solve before going into details of how that's implemented. This should be sufficiently detailed to answer all questions such as "why do I need this critical section described below?". > + The writers hold the GL(dl_load_lock), but the readers don't, so atomics > + should be used when accessing these globals. This covers the low-level data races, but it doesn't yet explain why you'd want the critical sections. Think about which atomicity and ordering guarantees you need to make your high-level synchronization scheme work. > + dl_open_worker (called from dlopen) for each loaded module increases > + GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new > + slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and > + the next generation number GL(dl_tls_generation)+1. Then it increases > + GL(dl_tls_generation) which sinals that the new slotinfo entries are ready. You describe this as if it were sequential code. Is it, or do you need more detail? > + This last write is release mo so previous writes can be synchronized. Terminology: the matching acquire MO load synchronized with the release MO write. The writes sequenced before the release MO store than happen before stuff sequenced after the acquire MO load. Also, uppercase MO so it's clear it's an abbreviation. > + GL(dl_tls_max_dtv_idx) is always an upper bound of the modids of the ready > + entries. The slotinfo list might be shorter than that during dlopen. > + Entries in the slotinfo list might have gen > GL(dl_tls_generation) and > + map == NULL. That bit sounds like a candidate for the high-level scheme. > + _dl_allocate_tls_init is called from pthread_create and it looks through > + the slotinfo list to do the dynamic TLS and DTV setup for the new thread. > + It first loads the current GL(dl_tls_generation) with acquire mo and only > + considers modules up to that generation ignoring any later change to the > + slotinfo list. > + > + TODO: Entries might get changed and freed in dlclose without sync. > + TODO: __tls_get_addr is not yet synchronized with dlopen and dlclose. > +*/ > + > void * > internal_function > _dl_allocate_tls_init (void *result) > @@ -455,9 +497,18 @@ _dl_allocate_tls_init (void *result) > struct dtv_slotinfo_list *listp; > size_t total = 0; > size_t maxgen = 0; > - > - /* Check if the current dtv is big enough. */ > - if (dtv[-1].counter < GL(dl_tls_max_dtv_idx)) > + size_t gen_count; > + size_t dtv_slots; > + > + /* Synchronize with the release mo store in dl_open_worker, modules with > + larger generation number are ignored. */ > + gen_count = atomic_load_acquire (&GL(dl_tls_generation)); > + /* Check if the current dtv is big enough. GL(dl_tls_max_dtv_idx) is > + concurrently modified, but after the release mo store to "after" in terms of which relation? This is a good example of a place where you could probably refer to a part of the high-level synchronization scheme. > + GL(dl_tls_generation) it always remains a modid upper bound for > + previously loaded modules so relaxed access is enough. */ > + dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx)); > + if (dtv[-1].counter < dtv_slots) > { > /* Resize the dtv. */ > dtv = _dl_resize_dtv (dtv); > @@ -480,18 +531,25 @@ _dl_allocate_tls_init (void *result) > void *dest; > > /* Check for the total number of used slots. */ > - if (total + cnt > GL(dl_tls_max_dtv_idx)) > + if (total + cnt > dtv_slots) > break; > > - map = listp->slotinfo[cnt].map; > + /* Synchronize with the release mo store in _dl_add_to_slotinfo in > + dlopen, so the generation number read below is for a valid entry. Isn't it in dl-tls.c? > + TODO: remove_slotinfo in dlclose is not synchronized. */ > + map = atomic_load_acquire (&listp->slotinfo[cnt].map); > if (map == NULL) > /* Unused entry. */ > continue; It might be worthwhile to point out why missing out on a concurrent update to map would be okay here, or why it can't actually happen. > + size_t gen = listp->slotinfo[cnt].gen; > + if (gen > gen_count) > + /* New, concurrently loaded entry. */ > + continue; > + > /* Keep track of the maximum generation number. This might > not be the generation counter. */ > - assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation)); See my comments elsewhere in the thread. > - maxgen = MAX (maxgen, listp->slotinfo[cnt].gen); > + maxgen = MAX (maxgen, gen); > > dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED; > dtv[map->l_tls_modid].pointer.is_static = false; > @@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result) > } > > total += cnt; > - if (total >= GL(dl_tls_max_dtv_idx)) > + if (total >= dtv_slots) Could this have changed in the meantime, and thus be large enough? You change here what is potentially more than one memory access (depending on what the compiler does to this, strictly speaking, sequential code) to a single one in the beginning. > break; > > - listp = listp->next; > - assert (listp != NULL); > + /* Synchronize with the release mo store in _dl_add_to_slotinfo > + so only initialized slotinfo nodes are looked at. */ > + listp = atomic_load_acquire (&listp->next); > + if (listp == NULL) > + break; > } > > /* The DTV version is up-to-date now. */ > @@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l) > the first slot. */ > assert (idx == 0); > > - listp = prevp->next = (struct dtv_slotinfo_list *) > + listp = (struct dtv_slotinfo_list *) > malloc (sizeof (struct dtv_slotinfo_list) > + TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo)); > if (listp == NULL) There's another store to dl_tls_generation around here that you're not accessing atomically. > @@ -939,9 +1000,15 @@ cannot create TLS data structures")); > listp->next = NULL; > memset (listp->slotinfo, '\0', > TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo)); > + /* _dl_allocate_tls_init concurrently walks this list at thread > + creation, it must only observe initialized nodes in the list. Another bit for the high-level scheme. > + See the CONCURRENCY NOTES there. */ > + atomic_store_release (&prevp->next, listp); > } > > /* Add the information into the slotinfo data structure. */ > - listp->slotinfo[idx].map = l; > listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1; > + /* Synchronize with the acquire load in _dl_allocate_tls_init. Say why you need to enforce this, or point to part in the higher-level scheme where this is explained. > + See the CONCURRENCY NOTES there. */ > + atomic_store_release (&listp->slotinfo[idx].map, l); > }
On 22/01/16 01:02, Torvald Riegel wrote: > I've just looked at your patch so far and haven't analyzed all the > affected code. Nonetheless, a few comments below; so far, I would thanks > rather postpone this to 2.23 because you do change a few things and I ok >> - if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0)) >> - _dl_fatal_printf (N_("\ >> + if (any_tls) >> + { >> + size_t newgen = GL(dl_tls_generation) + 1; > > If you access a memory location with atomics somewhere, access it with > atomics everywhere (initialization is typically an exception). This is ok > important because it's our way to deal with not having atomic types > right now (and so the compiler is aware, too), and because it's > important for readers of the code because it alerts them that this is > concurrent code. > >> + if (__builtin_expect (newgen == 0, 0)) >> + _dl_fatal_printf (N_("\ >> TLS generation counter wrapped! Please report this.")); >> + /* Synchronize with the load acquire in _dl_allocate_tls_init. >> + See the CONCURRENCY NOTES there in dl-tls.c. */ >> + atomic_store_release (&GL(dl_tls_generation), newgen); > > Is there any reason you wouldn't want a release MO atomic increment > here? i wanted to avoid storing an overflowing counter. (didn't want to analyze that case) >> + During dynamic TLS and DTV allocation and setup various objects may be >> + accessed concurrently: >> + >> + GL(dl_tls_max_dtv_idx) >> + GL(dl_tls_generation) >> + listp->slotinfo[i].map >> + listp->slotinfo[i].gen >> + listp->next >> + >> + where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list. The public >> + APIs that may access them are >> + >> + Writers: dlopen, dlclose and dynamic linker start up code. >> + Readers only: pthread_create and __tls_get_addr (TLS access). > > I would first describe what these functions / features intend to do and > what synchronization problem they have to solve before going into > details of how that's implemented. This should be sufficiently detailed > to answer all questions such as "why do I need this critical section > described below?". i probably don't know all the details of that. >> + The writers hold the GL(dl_load_lock), but the readers don't, so atomics >> + should be used when accessing these globals. > > This covers the low-level data races, but it doesn't yet explain why > you'd want the critical sections. Think about which atomicity and > ordering guarantees you need to make your high-level synchronization > scheme work. i can write more about the synchronizations i added, but not much about what other state dl_load_lock protects. >> + dl_open_worker (called from dlopen) for each loaded module increases >> + GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new >> + slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and >> + the next generation number GL(dl_tls_generation)+1. Then it increases >> + GL(dl_tls_generation) which sinals that the new slotinfo entries are ready. > > You describe this as if it were sequential code. Is it, or do you need > more detail? yes this is sequential code in dlopen. (but scattered around in many functions so it is not immediately obvious) >> + This last write is release mo so previous writes can be synchronized. > > Terminology: the matching acquire MO load synchronized with the release > MO write. The writes sequenced before the release MO store than happen > before stuff sequenced after the acquire MO load. > > Also, uppercase MO so it's clear it's an abbreviation. ok >> - >> - /* Check if the current dtv is big enough. */ >> - if (dtv[-1].counter < GL(dl_tls_max_dtv_idx)) >> + size_t gen_count; >> + size_t dtv_slots; >> + >> + /* Synchronize with the release mo store in dl_open_worker, modules with >> + larger generation number are ignored. */ >> + gen_count = atomic_load_acquire (&GL(dl_tls_generation)); >> + /* Check if the current dtv is big enough. GL(dl_tls_max_dtv_idx) is >> + concurrently modified, but after the release mo store to > > "after" in terms of which relation? This is a good example of a place > where you could probably refer to a part of the high-level > synchronization scheme. if the load of *_idx "happens after" the store of *_generation then it is an upper bound to modids at that generation. >> + GL(dl_tls_generation) it always remains a modid upper bound for >> + previously loaded modules so relaxed access is enough. */ >> + dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx)); ... >> - map = listp->slotinfo[cnt].map; >> + /* Synchronize with the release mo store in _dl_add_to_slotinfo in >> + dlopen, so the generation number read below is for a valid entry. > > Isn't it in dl-tls.c? yes, but it is called from dlopen. >> + TODO: remove_slotinfo in dlclose is not synchronized. */ >> + map = atomic_load_acquire (&listp->slotinfo[cnt].map); >> if (map == NULL) >> /* Unused entry. */ >> continue; > > It might be worthwhile to point out why missing out on a concurrent > update to map would be okay here, or why it can't actually happen. ok concurrent update of the relevant maps don't happen (except in dlclose.. but that bug is not fixed) only slotinfo entries that should be ignored here may change concurrently (and the dtv/tls of those are initialized lazily at first access). >> + size_t gen = listp->slotinfo[cnt].gen; >> + if (gen > gen_count) >> + /* New, concurrently loaded entry. */ >> + continue; >> + >> /* Keep track of the maximum generation number. This might >> not be the generation counter. */ >> - assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation)); > > See my comments elsewhere in the thread. i can keep the assert but the gen > gen_count cases must be skipped here so the only way the assert can fail is if the generation number overflows. (otherwise gen <= gen_count <= GL(dl_tls_generation) holds) >> - maxgen = MAX (maxgen, listp->slotinfo[cnt].gen); >> + maxgen = MAX (maxgen, gen); >> >> dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED; >> dtv[map->l_tls_modid].pointer.is_static = false; >> @@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result) >> } >> >> total += cnt; >> - if (total >= GL(dl_tls_max_dtv_idx)) >> + if (total >= dtv_slots) > > Could this have changed in the meantime, and thus be large enough? You > change here what is potentially more than one memory access (depending > on what the compiler does to this, strictly speaking, sequential code) > to a single one in the beginning. dtv_slots at the beginning is already an upper bound to the relevant modids (and that's how much dtv we allocated), so we can finish the iteration if total visited modules grows beyond that. rereading it does not cause a problem, but it does not help either (can be smaller or larger as well, depending on what dlopen/dlclose happened concurrently, the slotinfo list does not shrink though) >> @@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l) >> the first slot. */ >> assert (idx == 0); >> >> - listp = prevp->next = (struct dtv_slotinfo_list *) >> + listp = (struct dtv_slotinfo_list *) >> malloc (sizeof (struct dtv_slotinfo_list) >> + TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo)); >> if (listp == NULL) > > There's another store to dl_tls_generation around here that you're not > accessing atomically. > yes, the oom code path that looked broken logic: it does not check for overflow (bug) it leaks memory (another bug) and the comment roughly says "the application will crash anyway" so i wanted to avoid messing with it. but yes that should be atomic too. >> @@ -939,9 +1000,15 @@ cannot create TLS data structures")); >> listp->next = NULL; >> memset (listp->slotinfo, '\0', >> TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo)); >> + /* _dl_allocate_tls_init concurrently walks this list at thread >> + creation, it must only observe initialized nodes in the list. > > Another bit for the high-level scheme. > >> + See the CONCURRENCY NOTES there. */ >> + atomic_store_release (&prevp->next, listp); >> } >> >> /* Add the information into the slotinfo data structure. */ >> - listp->slotinfo[idx].map = l; >> listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1; >> + /* Synchronize with the acquire load in _dl_allocate_tls_init. > > Say why you need to enforce this, or point to part in the higher-level > scheme where this is explained. _dl_allocate_tls_init walks the slotinfo list beyond the entries it actually cares about, the extra entries are either unused (zero) or initialized here with large gen number. so those extra entries must be possible to recognize. i used sync so either map==0 or gen>gen_count is visible to _dl_allocate_tls_init for new entries. however now i see yet another bug here: the gen can overflow and become 0 that way too, so the right fix is to check for (map==0 || gen==0 || gen>gen_count) and no sync is needed only relaxed atomic access. but i have to verify gen==0 cannot happen otherwise so it's ok to skip it in _dl_allocate_tls_init. >> + See the CONCURRENCY NOTES there. */ >> + atomic_store_release (&listp->slotinfo[idx].map, l); >> } > > >
On Fri, 2016-01-22 at 19:31 +0000, Szabolcs Nagy wrote: > On 22/01/16 01:02, Torvald Riegel wrote: > > important because it's our way to deal with not having atomic types > > right now (and so the compiler is aware, too), and because it's > > important for readers of the code because it alerts them that this is > > concurrent code. > > > >> + if (__builtin_expect (newgen == 0, 0)) > >> + _dl_fatal_printf (N_("\ > >> TLS generation counter wrapped! Please report this.")); > >> + /* Synchronize with the load acquire in _dl_allocate_tls_init. > >> + See the CONCURRENCY NOTES there in dl-tls.c. */ > >> + atomic_store_release (&GL(dl_tls_generation), newgen); > > > > Is there any reason you wouldn't want a release MO atomic increment > > here? > > i wanted to avoid storing an overflowing counter. > (didn't want to analyze that case) If there are concurrent increments, not using an atomic increment will loose updates. If that can happen, the code should point out why that's okay. Possible countermeasures are using a CAS loop (less scalable if there is contention (is there?) but simple) or setting the overflow threshold low enough and implementing an anti-overflow mechanism so that the maximum number of threads that can run concurrently can never actually cause an overflow (not simple, but also not too hard; see the new barrier, for example). > >> + During dynamic TLS and DTV allocation and setup various objects may be > >> + accessed concurrently: > >> + > >> + GL(dl_tls_max_dtv_idx) > >> + GL(dl_tls_generation) > >> + listp->slotinfo[i].map > >> + listp->slotinfo[i].gen > >> + listp->next > >> + > >> + where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list. The public > >> + APIs that may access them are > >> + > >> + Writers: dlopen, dlclose and dynamic linker start up code. > >> + Readers only: pthread_create and __tls_get_addr (TLS access). > > > > I would first describe what these functions / features intend to do and > > what synchronization problem they have to solve before going into > > details of how that's implemented. This should be sufficiently detailed > > to answer all questions such as "why do I need this critical section > > described below?". > > i probably don't know all the details of that. That's okay given that this is a complex matter and we're still investigating all its parts. Nonetheless, if you can't describe the solution at a high level because you don't know all the details, it's (1) a good indication that this needs more work (IOW, "if you can't explain it, it may not be ready") and (2) worth pointing out in a comment in the code so that everyone else is aware that this piece of code is still WIP. > >> + The writers hold the GL(dl_load_lock), but the readers don't, so atomics > >> + should be used when accessing these globals. > > > > This covers the low-level data races, but it doesn't yet explain why > > you'd want the critical sections. Think about which atomicity and > > ordering guarantees you need to make your high-level synchronization > > scheme work. > > i can write more about the synchronizations i added, > but not much about what other state dl_load_lock > protects. Either should be an improvement. What I wanted to ask for is not just more information, but also how it's presented and structured. For example, knowing which state a lock "protects" is just a bit of the information necessary; it often makes more sense to think about the high-level atomicity and ordering constraints one needs first, and then see how to solve that with mutual exclusion and locks, and then it falls out which data (or rather logical state and state transactions!) are "protected" by a lock. > >> + dl_open_worker (called from dlopen) for each loaded module increases > >> + GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new > >> + slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and > >> + the next generation number GL(dl_tls_generation)+1. Then it increases > >> + GL(dl_tls_generation) which sinals that the new slotinfo entries are ready. > > > > You describe this as if it were sequential code. Is it, or do you need > > more detail? > > yes this is sequential code in dlopen. > (but scattered around in many functions so it > is not immediately obvious) > > >> + This last write is release mo so previous writes can be synchronized. > > > > Terminology: the matching acquire MO load synchronized with the release > > MO write. The writes sequenced before the release MO store than happen > > before stuff sequenced after the acquire MO load. > > > > Also, uppercase MO so it's clear it's an abbreviation. > > ok > > >> - > >> - /* Check if the current dtv is big enough. */ > >> - if (dtv[-1].counter < GL(dl_tls_max_dtv_idx)) > >> + size_t gen_count; > >> + size_t dtv_slots; > >> + > >> + /* Synchronize with the release mo store in dl_open_worker, modules with > >> + larger generation number are ignored. */ > >> + gen_count = atomic_load_acquire (&GL(dl_tls_generation)); > >> + /* Check if the current dtv is big enough. GL(dl_tls_max_dtv_idx) is > >> + concurrently modified, but after the release mo store to > > > > "after" in terms of which relation? This is a good example of a place > > where you could probably refer to a part of the high-level > > synchronization scheme. > > if the load of *_idx "happens after" the store of *_generation > then it is an upper bound to modids at that generation. The "happens after" is either enforced by something, and then you should refer to that piece. Or if the loads reads from (as in the reads-from relation in the memory model) the store, then it either may itself enforce a happens-before, or use something else. IOW, I still don't understand what you really mean :) > >> + GL(dl_tls_generation) it always remains a modid upper bound for > >> + previously loaded modules so relaxed access is enough. */ > >> + dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx)); > ... > >> - map = listp->slotinfo[cnt].map; > >> + /* Synchronize with the release mo store in _dl_add_to_slotinfo in > >> + dlopen, so the generation number read below is for a valid entry. > > > > Isn't it in dl-tls.c? > > yes, but it is called from dlopen. So please say that it is called from dlopen, not "in dlopen". > >> + TODO: remove_slotinfo in dlclose is not synchronized. */ > >> + map = atomic_load_acquire (&listp->slotinfo[cnt].map); > >> if (map == NULL) > >> /* Unused entry. */ > >> continue; > > > > It might be worthwhile to point out why missing out on a concurrent > > update to map would be okay here, or why it can't actually happen. > > ok > > concurrent update of the relevant maps don't happen > (except in dlclose.. but that bug is not fixed) > > only slotinfo entries that should be ignored here may > change concurrently (and the dtv/tls of those are > initialized lazily at first access). > > >> + size_t gen = listp->slotinfo[cnt].gen; > >> + if (gen > gen_count) > >> + /* New, concurrently loaded entry. */ > >> + continue; > >> + > >> /* Keep track of the maximum generation number. This might > >> not be the generation counter. */ > >> - assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation)); > > > > See my comments elsewhere in the thread. > > i can keep the assert but the gen > gen_count cases > must be skipped here so the only way the assert can > fail is if the generation number overflows. > (otherwise gen <= gen_count <= GL(dl_tls_generation) holds) > > >> - maxgen = MAX (maxgen, listp->slotinfo[cnt].gen); > >> + maxgen = MAX (maxgen, gen); > >> > >> dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED; > >> dtv[map->l_tls_modid].pointer.is_static = false; > >> @@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result) > >> } > >> > >> total += cnt; > >> - if (total >= GL(dl_tls_max_dtv_idx)) > >> + if (total >= dtv_slots) > > > > Could this have changed in the meantime, and thus be large enough? You > > change here what is potentially more than one memory access (depending > > on what the compiler does to this, strictly speaking, sequential code) > > to a single one in the beginning. > > dtv_slots at the beginning is already an upper bound to > the relevant modids (and that's how much dtv we allocated), > so we can finish the iteration if total visited modules > grows beyond that. > > rereading it does not cause a problem, but it does not > help either (can be smaller or larger as well, depending > on what dlopen/dlclose happened concurrently, the > slotinfo list does not shrink though) While we can discuss the analysis here, eventually the outcome needs to be included as comments in the code, as part of the patch. If the intent / high-level scheme isn't obvious to see (which is the case given that we have to discuss it here), then it should be documented, IMO. > >> @@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l) > >> the first slot. */ > >> assert (idx == 0); > >> > >> - listp = prevp->next = (struct dtv_slotinfo_list *) > >> + listp = (struct dtv_slotinfo_list *) > >> malloc (sizeof (struct dtv_slotinfo_list) > >> + TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo)); > >> if (listp == NULL) > > > > There's another store to dl_tls_generation around here that you're not > > accessing atomically. > > > > yes, the oom code path > > that looked broken logic: it does not check for overflow (bug) > it leaks memory (another bug) and the comment roughly says > "the application will crash anyway" so i wanted to avoid messing > with it. > > but yes that should be atomic too. > > >> @@ -939,9 +1000,15 @@ cannot create TLS data structures")); > >> listp->next = NULL; > >> memset (listp->slotinfo, '\0', > >> TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo)); > >> + /* _dl_allocate_tls_init concurrently walks this list at thread > >> + creation, it must only observe initialized nodes in the list. > > > > Another bit for the high-level scheme. > > > >> + See the CONCURRENCY NOTES there. */ > >> + atomic_store_release (&prevp->next, listp); > >> } > >> > >> /* Add the information into the slotinfo data structure. */ > >> - listp->slotinfo[idx].map = l; > >> listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1; > >> + /* Synchronize with the acquire load in _dl_allocate_tls_init. > > > > Say why you need to enforce this, or point to part in the higher-level > > scheme where this is explained. > > _dl_allocate_tls_init walks the slotinfo list beyond > the entries it actually cares about, the extra entries > are either unused (zero) or initialized here with > large gen number. > > so those extra entries must be possible to recognize. > > i used sync so either map==0 or gen>gen_count is > visible to _dl_allocate_tls_init for new entries. > > however now i see yet another bug here: the gen can > overflow and become 0 that way too, so the right fix > is to check for (map==0 || gen==0 || gen>gen_count) > and no sync is needed only relaxed atomic access. > > but i have to verify gen==0 cannot happen otherwise > so it's ok to skip it in _dl_allocate_tls_init. You'll probably have noticed that I'm asking several small corner-case questions. I my experience, doing that is helpful to figure out whether the synchronization really works correctly all the time. And a good way to do this by yourself, before you send a patch, is to try to write a thorough explanation of the concurrent code you're working on. I've always found this helpful. The additional benefit is that you'll have comments ready so others can understand your code easier. If you have a thorough explanation, you also already have answers for all the little questions, which may or may not(!) have obvious answers. Therefore, I'd strongly suggest to work on explanations too, even if that doesn't yield a larger LOC count. But the complexity of concurrent code cannot be measured in LOC anyway.
diff --git a/elf/dl-open.c b/elf/dl-open.c index 6f178b3..2b97605 100644 --- a/elf/dl-open.c +++ b/elf/dl-open.c @@ -524,9 +524,16 @@ dl_open_worker (void *a) } /* Bump the generation number if necessary. */ - if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0)) - _dl_fatal_printf (N_("\ + if (any_tls) + { + size_t newgen = GL(dl_tls_generation) + 1; + if (__builtin_expect (newgen == 0, 0)) + _dl_fatal_printf (N_("\ TLS generation counter wrapped! Please report this.")); + /* Synchronize with the load acquire in _dl_allocate_tls_init. + See the CONCURRENCY NOTES there in dl-tls.c. */ + atomic_store_release (&GL(dl_tls_generation), newgen); + } /* We need a second pass for static tls data, because _dl_update_slotinfo must not be run while calls to _dl_add_to_slotinfo are still pending. */ diff --git a/elf/dl-tls.c b/elf/dl-tls.c index ed13fd9..7184a54 100644 --- a/elf/dl-tls.c +++ b/elf/dl-tls.c @@ -443,6 +443,48 @@ _dl_resize_dtv (dtv_t *dtv) } +/* CONCURRENCY NOTES: + + During dynamic TLS and DTV allocation and setup various objects may be + accessed concurrently: + + GL(dl_tls_max_dtv_idx) + GL(dl_tls_generation) + listp->slotinfo[i].map + listp->slotinfo[i].gen + listp->next + + where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list. The public + APIs that may access them are + + Writers: dlopen, dlclose and dynamic linker start up code. + Readers only: pthread_create and __tls_get_addr (TLS access). + + The writers hold the GL(dl_load_lock), but the readers don't, so atomics + should be used when accessing these globals. + + dl_open_worker (called from dlopen) for each loaded module increases + GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new + slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and + the next generation number GL(dl_tls_generation)+1. Then it increases + GL(dl_tls_generation) which sinals that the new slotinfo entries are ready. + This last write is release mo so previous writes can be synchronized. + + GL(dl_tls_max_dtv_idx) is always an upper bound of the modids of the ready + entries. The slotinfo list might be shorter than that during dlopen. + Entries in the slotinfo list might have gen > GL(dl_tls_generation) and + map == NULL. + + _dl_allocate_tls_init is called from pthread_create and it looks through + the slotinfo list to do the dynamic TLS and DTV setup for the new thread. + It first loads the current GL(dl_tls_generation) with acquire mo and only + considers modules up to that generation ignoring any later change to the + slotinfo list. + + TODO: Entries might get changed and freed in dlclose without sync. + TODO: __tls_get_addr is not yet synchronized with dlopen and dlclose. +*/ + void * internal_function _dl_allocate_tls_init (void *result) @@ -455,9 +497,18 @@ _dl_allocate_tls_init (void *result) struct dtv_slotinfo_list *listp; size_t total = 0; size_t maxgen = 0; - - /* Check if the current dtv is big enough. */ - if (dtv[-1].counter < GL(dl_tls_max_dtv_idx)) + size_t gen_count; + size_t dtv_slots; + + /* Synchronize with the release mo store in dl_open_worker, modules with + larger generation number are ignored. */ + gen_count = atomic_load_acquire (&GL(dl_tls_generation)); + /* Check if the current dtv is big enough. GL(dl_tls_max_dtv_idx) is + concurrently modified, but after the release mo store to + GL(dl_tls_generation) it always remains a modid upper bound for + previously loaded modules so relaxed access is enough. */ + dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx)); + if (dtv[-1].counter < dtv_slots) { /* Resize the dtv. */ dtv = _dl_resize_dtv (dtv); @@ -480,18 +531,25 @@ _dl_allocate_tls_init (void *result) void *dest; /* Check for the total number of used slots. */ - if (total + cnt > GL(dl_tls_max_dtv_idx)) + if (total + cnt > dtv_slots) break; - map = listp->slotinfo[cnt].map; + /* Synchronize with the release mo store in _dl_add_to_slotinfo in + dlopen, so the generation number read below is for a valid entry. + TODO: remove_slotinfo in dlclose is not synchronized. */ + map = atomic_load_acquire (&listp->slotinfo[cnt].map); if (map == NULL) /* Unused entry. */ continue; + size_t gen = listp->slotinfo[cnt].gen; + if (gen > gen_count) + /* New, concurrently loaded entry. */ + continue; + /* Keep track of the maximum generation number. This might not be the generation counter. */ - assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation)); - maxgen = MAX (maxgen, listp->slotinfo[cnt].gen); + maxgen = MAX (maxgen, gen); dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED; dtv[map->l_tls_modid].pointer.is_static = false; @@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result) } total += cnt; - if (total >= GL(dl_tls_max_dtv_idx)) + if (total >= dtv_slots) break; - listp = listp->next; - assert (listp != NULL); + /* Synchronize with the release mo store in _dl_add_to_slotinfo + so only initialized slotinfo nodes are looked at. */ + listp = atomic_load_acquire (&listp->next); + if (listp == NULL) + break; } /* The DTV version is up-to-date now. */ @@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l) the first slot. */ assert (idx == 0); - listp = prevp->next = (struct dtv_slotinfo_list *) + listp = (struct dtv_slotinfo_list *) malloc (sizeof (struct dtv_slotinfo_list) + TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo)); if (listp == NULL) @@ -939,9 +1000,15 @@ cannot create TLS data structures")); listp->next = NULL; memset (listp->slotinfo, '\0', TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo)); + /* _dl_allocate_tls_init concurrently walks this list at thread + creation, it must only observe initialized nodes in the list. + See the CONCURRENCY NOTES there. */ + atomic_store_release (&prevp->next, listp); } /* Add the information into the slotinfo data structure. */ - listp->slotinfo[idx].map = l; listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1; + /* Synchronize with the acquire load in _dl_allocate_tls_init. + See the CONCURRENCY NOTES there. */ + atomic_store_release (&listp->slotinfo[idx].map, l); }