Message ID | 594B92ED.6060809@arm.com |
---|---|
State | New, archived |
Headers |
Received: (qmail 98755 invoked by alias); 22 Jun 2017 09:50:48 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 98744 invoked by uid 89); 22 Jun 2017 09:50:47 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.4 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS autolearn=ham version=3.3.2 spammy= X-HELO: EUR01-DB5-obe.outbound.protection.outlook.com Authentication-Results: redhat.com; dkim=none (message not signed) header.d=none; redhat.com; dmarc=none action=none header.from=arm.com; Message-ID: <594B92ED.6060809@arm.com> Date: Thu, 22 Jun 2017 10:50:37 +0100 From: Szabolcs Nagy <szabolcs.nagy@arm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.8.0 MIME-Version: 1.0 To: Joseph Myers <joseph@codesourcery.com> CC: nd@arm.com, GNU C Library <libc-alpha@sourceware.org>, "triegel@redhat.com" <triegel@redhat.com> Subject: Re: [PATCH v2] Single threaded stdio optimization References: <594AA0A4.7010600@arm.com> <alpine.DEB.2.20.1706211651110.3924@digraph.polyomino.org.uk> In-Reply-To: <alpine.DEB.2.20.1706211651110.3924@digraph.polyomino.org.uk> Content-Type: multipart/mixed; boundary="------------020501040901080705080603" X-ClientProxiedBy: HE1PR0802CA0016.eurprd08.prod.outlook.com (2603:10a6:3:bd::26) To VI1PR0802MB2494.eurprd08.prod.outlook.com (2603:10a6:800:b6::22) X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: b6286cf8-c657-4cdc-1ea6-08d4b95424e6 X-MS-Office365-Filtering-HT: Tenant X-Microsoft-Antispam: UriScan:; BCL:0; PCL:0; RULEID:(300000500055)(300135000095)(300000501055)(300135300095)(22001)(300000502055)(300135100095)(300000503055)(300135400095)(48565401081)(49563074)(201703131423075)(201703031133081)(300000504055)(300135200095)(300000505055)(300135600095)(300000506048)(300135500095); SRVR:VI1PR0802MB2494; X-Microsoft-Exchange-Diagnostics: 1; VI1PR0802MB2494; 3:kMGJjmAaoBCjWBfFSVwCcFnIxxKlM4repi0nNsh687HVOATEvwE+p4RRkXvK/cff/V94DelOJElFFh7fFAS/YiHVYokO6GB11f3xVS8QXwzpGoi8dzkMjmJTrM1ADZlwzc5shtqA6He82aMU9NyPyZOwuQi77vfB9O6GQso9xqzS5p1AKk2H60xKPnd0pPXqnxJZ7Cg2Pg0f48ZieXEDuj/puxaFTAI9n4NaFNuAb+KCyGVGni3BBthyF9yyn3CYw4axvfrUxGwwhocoMM69LhBBYYVO6g+5BHnQQoFRTshYgkbRWax6aso85R1QFyCUKUkx0LVQQMMeluURUCTdGp3A6/0BWbafP9R5uBzZQvOSiw9a7wrv3gGk4WS4/A6Oxra0oOTL699C5uelMwcumCCp86+VqS+wIczQVU4rd+qFQIbUqbbmhwDZlVBIsuFT4tD5ij013gnDR6CppwSUcpIqMLaeMRxLPFCQBeuCCgM7jCpuY95uGss87DwmhopKmiM7raLn24UVW2rIP5+5Xm0+9nFVEV61gMQdOcvzen29iLCh7lDFBdZIAW6ap1pmbNEuXGSQGC+aUVoFIBm7d8dWaDq6CkKKRTp8DJqFZrnz8eKMVyHThOW/cS0af4ywHgjJGFKMBv0++JE8aAITvIyd+mLzLkNXRZOIiqbPyTr82YmuV8ohij6+KsumnfVrwLmnZmraOUbxbB4Nxwo8pnXfP49yJt7CAH78IV6pnlXt0HiGe8+/1hAgP/BICcglCxuhcHnay6ahvie4GnVmUL1Y6weMKlkOIsxtujZVTPw= X-MS-TrafficTypeDiagnostic: VI1PR0802MB2494: X-Microsoft-Exchange-Diagnostics: 1; VI1PR0802MB2494; 25:4KyQDE25LaH4SpqO8oyw6t0Uc1RL/GlWK9R0e3Y5pQI9mdTMvmxQIgsefQgIgqE1DvMr+kTRQdMRCWKIMbVDK4hEmPNmB1ASlpqqb925rFYxHOy/OhoPTKYCeujtqGS2E33KczkI7UvGx0C1N/XdkStGZD3oFgeQ1E6OxJIhXxeTB1jpjMgEgvOmvTiFY9Ibxpl0i6LILFs58KEWhVoJ3EmVPUBwVOPfkOCFUxUkI0Q7pvYAGdp+jFIK0OoYAOoe/YSBPV7ufXtjS17YVIMwrqwH4XbcbBAnK8D9SUE4JtvK9Y31di3hPTqFDuIG3w7KqavG5wK5KMUnIIv7m+Gw756oWT1DMKY+nKfKNJaLFpwogl/OsjL7p+OxcZe9QX8TPFfsbgr3+jUE6/ocwH+Sm7HeA24SeH9WVilgTRG4gjzcV8pT+0hCQbZexVfcvZrYT0jp3+uyaaFNZIpvWU7IRRK7wQ5SmXWrLpF3A61p3Wf+5ix+2WvUSB2HUeb2rTte+1UvalKN+UdjBu9y42Q47lFkJO+r583rbhK521q1S8TG9D7wqM5kGYC/5Qcr2FdQV0wZndfywJDuzO14vyCLcOEGjexi/0UoIJzRb+AX/brj4D4OYScmqLYxpGWorsjHh1X6rUoUdW1m6tNHXAMt3uFVk32+4jmUKI8Ju6kQxMCr4niXh7mpzXDak4urngj7OHrRnJqs95ZMp7cLgDoJdbMDzrj377MJYMSe3ikXKvFAIrHVgB/iEPIHAGEQSTqx9zQunmVxiNTsxYxQ9iBph7YRRXBZvMIaq9qTO3ded6INGLWIGP/3xgVG7nCPxFCEsYV61G7SD+9KFKC9WCZdER/aoYc+tpBG4DJ1OiPKZAZCrKSW5SvcfGYHA4AhMBBap5QCDBWCa8k/vAeaO8pcQvGoLTU6LYYpsYDJEIdlv+iBgaUGanfFuWbVaUjn5bSm X-Microsoft-Exchange-Diagnostics: 1; VI1PR0802MB2494; 31:xe3vUn8RNnkcyh9t5GNwnq8DsShVVJjMckGBydQGJOyMD+tc8jNaX068ELqJqCVLB6Xr9by7DUVO8k6WjPSSaym6N/d3z3oTAJhKf/A0Y77+VRhaw0eqfadcEyGwolPa4zM4PFUwFylDff5EG2QqdalrhgchrH4kMo9aAtgML4ZDt3kwPy70B/Tk4T0GBIlSXfPFo6NDOaZQdZjxQ9WpxqDl7wh5KRVjDv39XgeJiKvTU9LMLdqBgog7JHuP8peDfIWWh0HEHIbGuph6GwcVrU8+AXbdeodrvgQsbi8IVZbODDeJuAcqhaBs7ChO6SUi6N+grkSyoP+ksPQaj1U9IbMIgZxBpde/zF766oPYX5A7w8XgbLGiGZDkdi8saOtrlZ0u6IUHiUmKqmG9gS1/bbaefnjA9p7SATNeq+exrxpWhvDva8FqVlHwLSS3D6MlBjY2yO0rH3jtX5mh49jW8P4qO3WKHRALHq0fpYZWRjDPOc6jqrYIr4BbqBJYp+/sRH2ytQsjSdhsP6pcykLz250QDx0KXpLiQv/vArdzOrQIwspYYsEyGG2dYTl6kf9xNpG2RtAVy18DQMhHRGPK8s4NhcxSV/HVjRSAJqvS2bszoOv7Ny9kRnQ2WGi9iYT6FzY6aCYuDffX/NsoKc260exJB6WpV4iWIpYyrOkpjnuzWqIe9MFLETqwROS7qkfJPut3ACTi85UGxC9GPEjSxg==; 20:Fgh38HXNg24m9kLDnSGgnBL+MpZNpambzHdCgi/TghK9ky1/Y5psN3BmuHuCI00OyGD/W4klZYszHE8uub9GWoLcF5TZh7XmvG46/RROCt8T9vzX2vwAy53ON+1ocPRj8nC4VtlP9OMXmbwZ0727JJOrgu3Q8irBoN+ZvVf5dbs= NoDisclaimer: True X-Microsoft-Antispam-PRVS: <VI1PR0802MB2494717FCA965B3AE84FC045EDDB0@VI1PR0802MB2494.eurprd08.prod.outlook.com> X-Exchange-Antispam-Report-Test: UriScan:(180628864354917); X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(102415395)(6040450)(601004)(2401047)(8121501046)(5005006)(93006095)(93001095)(3002001)(10201501046)(100000703101)(100105400095)(6055026)(6041248)(20161123558100)(20161123560025)(20161123564025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123555025)(20161123562025)(6072148)(100000704101)(100105200095)(100000705101)(100105500095); SRVR:VI1PR0802MB2494; BCL:0; PCL:0; RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095); SRVR:VI1PR0802MB2494; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; VI1PR0802MB2494; 4:EWwRxnha06dIQUjTfPXyWONzi+BV3F5eaIm0GiTv?= =?us-ascii?Q?73WCeP4UK7UmoL0umEJ4XPf2WiEPNPQsWJI4pS085OfIR+YtsM4ucHfJF2s3?= =?us-ascii?Q?fxDkHKnbZAAPNGil1GROfQUSSWGgay5/92aSLkRxdC7pH7pJsbbzj44j7n59?= =?us-ascii?Q?gnNGt5PJmhEJicoXDcS8Jk8/02Ht81kR5/Z3QGfiOnAz5H/i3HpoqRJrQhHR?= =?us-ascii?Q?+N54eQj+4t96CUK+S1j3Q/9wGRB31yJ/EmV2gxb3AOhdcuDuiF7NHWCMIHZ1?= =?us-ascii?Q?7AaoijOFbOZnH74VZN9DznqSJSi5Wtu9OuwFVT2yM2uWAzf8IAm0N5x0pwmw?= =?us-ascii?Q?WQEBRhsS6hcorZ3bHiFZvAAakTf8wE4KpmQ8+544V5rZdRjTSRS62fW+QShx?= =?us-ascii?Q?JyOsDm1a7cPfN7zDReVIqwxa2RovLUIbzqhdMc/5LmoCGNdqVAUQ90svvf2k?= =?us-ascii?Q?CEF9+icCDqM7TlvYL8dSfBs8zA4gNHbTlETmAgcUUC82qO9PI0TACg1LHE4y?= =?us-ascii?Q?x0DRoPgfutexKUFcO1OC1gOC10NBrEwpNpNNMATmR8XjfpldF/IzsiWiRAGB?= =?us-ascii?Q?L4LUYMHhQ/ed8U88NxZtAi7wFEdEVZQXl2X6x88ia1DZUFQhlKC7livr7kWa?= =?us-ascii?Q?z3BlgxADtQoP+FvbkU4Pi0mWvoZZaviv5eJoI618Z4RyI2Rm/rUihV42JpDP?= =?us-ascii?Q?lUWqEz6r35SsEgMPelsscrADNC2RucvoXDLkjXSizpbaUhuPnlj0mSPWsk1V?= =?us-ascii?Q?fvGJHwnfPdVh9GHG+g5bN5bdSTgPIvZG5WjSNvUH7tX7BCL5yRSGR/Qu9HsX?= =?us-ascii?Q?42iExQ5K4yag3/LCMqFFKtj8CiVHfvH3uXv3jEclV8OAtGjFYoVb++2KD7Pi?= =?us-ascii?Q?kFBgja74YQvKLlFNk/krhtE8DGA7nGD7U2qDBzKy0I8ReKB+9QEhwEm18Cik?= =?us-ascii?Q?N7GfuTIXqmJNYuxjQwXoR/QhOBmmBYlb6BVpBELZ3d/947irr7q6bSdWt2x0?= =?us-ascii?Q?VK4mKvEJ8Yx2PVmwEez840HHmbDsqbicxrELECqviDAAgen2rhy9fORZh4Eo?= =?us-ascii?Q?Jn2Iw1Dy2wFQav1cibkYLOs8Nhi+k0ocyFHnao/bEzDLZeYblxxzfxU/VO7r?= =?us-ascii?Q?6HjfkHO8f9TmmYOuz4HAdj5h/os6ZUJv8ZUphLwqIBR8jBXPAFuJBgyMf6C7?= =?us-ascii?Q?FlZCogLm4hWPG2vZj0tEfMAOtm6tkr9nKjjEWLuwU7TZ19HVv2+0Olip5A?= =?us-ascii?Q?=3D=3D?= X-Forefront-PRVS: 03468CBA43 X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009020)(4630300001)(6049001)(6009001)(39410400002)(39450400003)(39850400002)(39400400002)(39840400002)(39860400002)(24454002)(377424004)(81166006)(54906002)(189998001)(305945005)(72206003)(4001350100001)(59896002)(21490400002)(2476003)(6116002)(3846002)(64126003)(270700001)(568964002)(65816999)(478600001)(8676002)(5660300001)(77096006)(54356999)(76176999)(50986999)(25786009)(6486002)(4326008)(6666003)(86362001)(2950100002)(6916009)(38730400002)(110136004)(83506001)(229853002)(84326002)(7736002)(6246003)(2906002)(65806001)(65956001)(66066001)(53546010)(33656002)(5890100001)(4610100001)(42186005)(36756003)(53936002); DIR:OUT; SFP:1101; SCL:1; SRVR:VI1PR0802MB2494; H:[10.2.206.69]; FPR:; SPF:None; MLV:sfv; LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; VI1PR0802MB2494; 23:6r0r+Qu2RjwSZqtXUHOi5pHmAS7egB/cS74gBHJ?= =?us-ascii?Q?mFcIqJcFk7bde82OU8MZ6+9J9RcYvwdvBs8QecOLw09jZ39XeB2y0SG0CHeu?= =?us-ascii?Q?yFJEFZx2o86VRfNCXxUp+HgtT2b19zUEBM33bgfhh/C3GrC72smJ3dby81Mf?= =?us-ascii?Q?ktpBTWq+9342JBhuPIP0NE+JywetKSpL5R96YuTyzyhuHtE2G546hzY+oLmm?= =?us-ascii?Q?+SZJu5VC3F6GmrQBZWbOx8vMRCCqX9PHtRNBtAvVfCWn/cORg30N/UWRZXw2?= =?us-ascii?Q?Cf+1vLr2db+YetvM3zHKruN8irmBFwonffL6WtOyVRqI0WWgCX6XKxd3pfDN?= =?us-ascii?Q?7GEyBcV2g1zyfimZt0bcFS1MpHASwanK4EggozNbocm5bv0vd68Hs89eDDFx?= =?us-ascii?Q?rPz5lDKlbfKqCndb3aj/rLg4dMlrptbHGEP+DUj/4xhpOjWYPove5MdGpq8d?= =?us-ascii?Q?ozvc35tfgh5Xs4+y0lHhKtxjYcnfWgyBMqW76dldNIakrsWOEy8ZWBGf4Nbg?= =?us-ascii?Q?0G/qJhf3MbNq27iK78qla/u5Za2YghPc7w2osMxqEgtZonEsdUNIlFmekuPa?= =?us-ascii?Q?sras1UuHSjct+959Ov9KW+UevIs0JdMQgDypiIcuYTw6VSIQ2ROY+l2O0o13?= =?us-ascii?Q?JCHx+gJyCq8CV2YjAit5EtzEyUFRnebjx7WtEBuztNvKjMnKiVpuEKtzJL2a?= =?us-ascii?Q?VpR1/w+b1KChNK5Kz3nSjknFCiG0AxtS9KiSRNMb/ATWWu1w0oo/l6Xdoovl?= =?us-ascii?Q?p3ZE6TmBrCUP4ohvPF7i514FRi2sZ44kArBCUH6RQnbcP0zcIe1cAmBKs/32?= =?us-ascii?Q?qVAf3Rb0bcHNsFP1r8yERk7R7vUepGMCuXSXGxs98gCjw4bMTDrnOsQBlMQV?= =?us-ascii?Q?YQpnZSo63nUoE0q1tCMj/IyVrThefI5xn9on+emyU6AAEa9loDf9HqKzwxCi?= =?us-ascii?Q?MQNjwnrJweIROhEsg6DofC2fZvY7hYN57KIfp4t/MlU1i+hmEnUo/QZ9dEVw?= =?us-ascii?Q?Y2XIT135nGpTD1eLpqI07lC9oH++Y5t2sAf3Fjr2x5SIWuRw2o+kTYrgyeNx?= =?us-ascii?Q?VHglYm9x056S+z/VhQS6a53pdxELeSLBiJRbF+AHF7sS5a7o3lHpiPJK5I57?= =?us-ascii?Q?tX6a37a2b5SqSGMgQebRrLTDRlHvARkrnDL4FNoJ6RzDTOKTsk08ClESyCEJ?= =?us-ascii?Q?T+VCmRCgJYHH35OX4BUvBUgAL7iWkq8u6vXlieJvS7K1x1rIchj4IVWchVVP?= =?us-ascii?Q?Q6lsq6x5pqui9coXaQvkbAeiX5FfyDvDNJNMAi7mZSfceqFIy7Uf+zk/WTXt?= =?us-ascii?Q?VJGF7tL7ReMsqSl048E3l8WT6qnaN2V5FeldBs1QCp745/YLqj0pEAy/mEMn?= =?us-ascii?Q?ilS6opWmr+htStnB4/lGNwrTvKOdC66W4RlOszMXpPlVp1R0NQ10iss3EPie?= =?us-ascii?Q?1ubsAQkxfJIEdEV+3jmEa+EIeF6vwJwY=3D?= X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; VI1PR0802MB2494; 6:C+BXpEnOSsU0MIdfCqNeHzZm4urI/sIK7wZ5vBar?= =?us-ascii?Q?s/XyyCYmyFzMNvcOGun5NCDY5MyLR+8aENfmJrGtX4hAzgoG9ybfLGv8U65q?= =?us-ascii?Q?VV6x3x7T5HPxAzjiyniaWVDci1p+zBy+YUCnMLZL/LatUKxRQL0+OsSS9PG+?= =?us-ascii?Q?6e2XRg7nXnUPKJov+7PB10+QEt20npxKQNJO6PNGJZnXBk62W5KbKXXIcq+6?= =?us-ascii?Q?ekiM/TpxsoMiNXlOakNcEnrxI/Ee43zsFjhJ3tPXjPMejbeEotVoyDXmALmo?= =?us-ascii?Q?3aMbDcumNLIENy186pHcXuI7SYPS0N4Lk1q2QB5G9r8ZfdysCBRC11Co1Qwv?= =?us-ascii?Q?JEeQ5t0yOCxhPo9ODXQQdJJUQi3h0gPsc9rm4iWyG/5pZUWLqC3k2+08cInl?= =?us-ascii?Q?CZ9E0F6BFhCurhHyTEURE25oh5zGpX+3Dg94DKu8ZGJi9hqXV0vNDi0WEJ8a?= =?us-ascii?Q?YQlvgw4bWpzZjWvqaAiCzpwM+K4OkD5BDZUFvJE9KnNS4fjy5BIJR235g66P?= =?us-ascii?Q?afB84CumjeaMMtnufFE75oHI3C+P/FCv4Y/YfdIWoWrnIzxyt/CpjACvpY6i?= =?us-ascii?Q?rfOCCxbUkkdyiV4Grm3zxDLKQWMaq1sRnWsBnBziRsHORvip+OoQ3fIWW2IZ?= =?us-ascii?Q?CH/QQmDxmmErGVGzJWVESYw0YxFQ19SWfKkzXGMHSiM4n/tc8JIrROamI/ym?= =?us-ascii?Q?wsCUs7ZYQmst+oxw5volN+sXA19WOdYHOIi/HlIeYAfOboWKLKpPEm+QIBMp?= =?us-ascii?Q?Vysw0bEbvLHoOI48pQwmYXLuZ0QXOIhdgJMlRuLYe/p0dIyRZBSiEm9IDDMj?= =?us-ascii?Q?0CFlkwNgu06Y7PTsxbqRf13ajYXrwubSrc4dqHM3de/HYI8w5A7v3PKT4GuR?= =?us-ascii?Q?yTIdcXz8OkgStbBiQAFQXNN1HJDOKboHhPjezqByzyn6e7n90OwESbbITOGh?= =?us-ascii?Q?qQVozPoQHVpepkjNfiOIxnpvWqUYX1Dtsq7XjbOwPkZXyYFl/PQVGmtbHXgx?= =?us-ascii?Q?/AdB5jsVdzvafqBckYL1+Nyv?= X-Microsoft-Exchange-Diagnostics: 1; VI1PR0802MB2494; 5:VstwtcOaH6xP1qRpDizPl6R0BZ0uZO5ytwZZERytxXSXb5W56wWYD5shuJjWoE/yLsd3m36nycfzKvmnB8TYBy4+8Jq5uQ3Oco/MfreTy/cHjAyewdRa4/jOWsDZJdVJ5u30sYxCn7nOGJ6i2vCmICqj16Dsu/VxpXOHInlbtC2G48OjlioKFX52udXTzc0lOleHtGxs1pxznLUZJYpQ95Ga/YDRwVyqWMOo99ZdLtmk11VJuyqmovzhTptmrqxo6iIulSPvhZH1zSYmNArcqbvDjqfeQngj8QBLSmYryKNRTu3oUb3HVIPMSg8QFP1mpkEoJsgx6nbwS7ysKJwbdEvfuSYG43EN6UGHssM3m3D2ivXnwU0z/HgGl+Qx8WpfNXPn6QYnP5erAbd8hOa7UyK0R+6bdBycbpilU6UycDDi6EgWuH4I4mjy+m7i06wXVvlJkxQQwdnkyfFv/zH/BjazaQlhK5LGj+neChQPWLwFcXjxU7kulVxvZW/ul7D6; 24:icZ9fzPzYEAMKTEUqc8oudfiLcq/rWsI4Ghb8FQ9NQwl5HbIkTVAd62hmEkrUV03cRb7VsDcEPtKd7YSKm3Hra5VAUxzqNiZPfVILcbSyF4= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; VI1PR0802MB2494; 7:YpJ7sxa//N1H+He3B0rXvCyrf9N4YElODPLRBBwzgl/6cbjLy11wOh/PsjmjsVBTaucul9wk7qq3CvxTmrEdun0joItXtF5etmzuFtTYKnMUMqXFGg2f8Y9fxaK9jQ8Rsff8rU63pHqpvgdW1oUCT6Qonn6lJI+PWWICheK7rtwm5HRhJX0D7i04VUEc/8TSb/ON057PIk/GbZWyqw9qAOLbgSBPgYkz22/dfbaThcuA/kiGFX5C9vSxrl2I3vpsJA/z1+AvVzQxIMQJxr2tmwckJDBZvi7qqcU7WXAp9VW8hhnU8UFRr9iFGjwB3vy4OAosgkYcoiC2of226ZFHsUT9GswrILU3jDUWax/UsSyK4nZZOYU7Zg+S674LUT+gLmywjnD8YfxyuK9rDMmCKNQnsTm6k0XPdfhbW7yiZ2gBbFEQdDjpTohUwv1P3fq2JJVFkN1AIxCiAxErQSef+FQiCvzJjjaezN4grtfewoxeF9j38GrFONaA2C0Ojbuw16B779CAHavoPrY0+39SV9YNlIPP9oI8Z60UtebPJja+oF7Xas+WpeR/mwtWjaXDZVwggeOWsAgJlG0WVvcse5G4+pyR3ab7wdUF3++boCFme5krCgb/RuTvq6NM1rSIdzyzekGQ6y/t3yH7dSk49etpDAwG5zKGBwQc40RGYr303m9JZyMi6OFkCwCVF8MOsvBkRpYUn0h3Sukzeh8c9gr31OLVzgM0IWWDL31tCBxAUiPOrgQVex2NRbnpyzA78ZIUUm3Ldw+hYk3nKdK9iCpszvacl93nPxq9lLKu804= X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 22 Jun 2017 09:50:40.6816 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0802MB2494 |
Commit Message
Szabolcs Nagy
June 22, 2017, 9:50 a.m. UTC
On 21/06/17 17:52, Joseph Myers wrote: > On Wed, 21 Jun 2017, Szabolcs Nagy wrote: > >> created. A new libc abi symbol, _IO_enable_locks, does this. >> (The abilist files will have to be updated in a separate patch). > > I don't see any declarations or uses of this symbol in public headers, so > why does it need to be at a public symbol version instead of > GLIBC_PRIVATE? > PATCH v3: use GLIBC_PRIVATE for _IO_enable_locks. (should i use __libc_* naming convention for private symbols?) Locking overhead can be significant in some stdio operations that are common in single threaded applications. I'd like to address it in this release if possible as it causes performance problems to aarch64 users. I prefer this high-level approach to be reviewed, if that does not work then the aarch64 specific low-level approach will be taken. This patch adds the _IO_FLAGS2_NEED_LOCK flag to indicate if an _IO_FILE object needs to be locked and some of the stdio functions just jump to their _unlocked variant when not. The flag is set on all _IO_FILE objects when the first thread is created. A new GLIBC_PRIVATE libc symbol, _IO_enable_locks, was added to do this from libpthread. The optimization can be applied to more stdio functions, currently it is only applied to single flag check or single non-wide-char standard operations. The flag should probably be never set for files with _IO_USER_LOCK, but that's just a further optimization, not a correctness requirement. The optimization is valid in a single thread because stdio operations are non-as-safe (so lock state is not observable from a signal handler) and stdio locks are recursive (so lock state is not observable via deadlock). The optimization is not valid if a thread may be created while an stdio lock is taken and thus it should be disabled if any user code may run during an stdio operation (interposed malloc, printf hooks, etc). This makes the optimization more complicated for some stdio operations (e.g. printf), but those are bigger and thus less important to optimize so this patch does not try to do that. 2017-06-22 Szabolcs Nagy <szabolcs.nagy@arm.com> * libio/libio.h (_IO_FLAGS2_NEED_LOCK, _IO_need_lock): Define. * libio/libioP.h (_IO_enable_locks): Declare. * libio/Versions (_IO_enable_locks): New symbol. * libio/genops.c (_IO_enable_locks): Define. (_IO_old_init): Initialize flags2. * libio/feof.c.c (_IO_feof): Avoid locking when not needed. * libio/ferror.c (_IO_ferror): Likewise. * libio/fputc.c (fputc): Likewise. * libio/putc.c (_IO_putc): Likewise. * libio/getc.c (_IO_getc): Likewise. * libio/getchar.c (getchar): Likewise. * libio/ioungetc.c (_IO_ungetc): Likewise. * nptl/pthread_create.c (__pthread_create_2_1): Enable stdio locks. * libio/iofopncook.c (_IO_fopencookie): Enable locking for the file. * sysdeps/pthread/flockfile.c (__flockfile): Likewise.
Comments
On Thursday 22 June 2017 03:20 PM, Szabolcs Nagy wrote: > On 21/06/17 17:52, Joseph Myers wrote: >> On Wed, 21 Jun 2017, Szabolcs Nagy wrote: >> >>> created. A new libc abi symbol, _IO_enable_locks, does this. >>> (The abilist files will have to be updated in a separate patch). >> >> I don't see any declarations or uses of this symbol in public headers, so >> why does it need to be at a public symbol version instead of >> GLIBC_PRIVATE? >> > > PATCH v3: use GLIBC_PRIVATE for _IO_enable_locks. > (should i use __libc_* naming convention for private symbols?) > > Locking overhead can be significant in some stdio operations > that are common in single threaded applications. > > I'd like to address it in this release if possible as it > causes performance problems to aarch64 users. I prefer this > high-level approach to be reviewed, if that does not work then > the aarch64 specific low-level approach will be taken. > > This patch adds the _IO_FLAGS2_NEED_LOCK flag to indicate if > an _IO_FILE object needs to be locked and some of the stdio > functions just jump to their _unlocked variant when not. The > flag is set on all _IO_FILE objects when the first thread is > created. A new GLIBC_PRIVATE libc symbol, _IO_enable_locks, > was added to do this from libpthread. > > The optimization can be applied to more stdio functions, > currently it is only applied to single flag check or single > non-wide-char standard operations. The flag should probably > be never set for files with _IO_USER_LOCK, but that's just a > further optimization, not a correctness requirement. > > The optimization is valid in a single thread because stdio > operations are non-as-safe (so lock state is not observable > from a signal handler) and stdio locks are recursive (so lock > state is not observable via deadlock). The optimization is not > valid if a thread may be created while an stdio lock is taken > and thus it should be disabled if any user code may run during > an stdio operation (interposed malloc, printf hooks, etc). > This makes the optimization more complicated for some stdio > operations (e.g. printf), but those are bigger and thus less > important to optimize so this patch does not try to do that. > > 2017-06-22 Szabolcs Nagy <szabolcs.nagy@arm.com> > > * libio/libio.h (_IO_FLAGS2_NEED_LOCK, _IO_need_lock): Define. > * libio/libioP.h (_IO_enable_locks): Declare. > * libio/Versions (_IO_enable_locks): New symbol. > * libio/genops.c (_IO_enable_locks): Define. > (_IO_old_init): Initialize flags2. > * libio/feof.c.c (_IO_feof): Avoid locking when not needed. > * libio/ferror.c (_IO_ferror): Likewise. > * libio/fputc.c (fputc): Likewise. > * libio/putc.c (_IO_putc): Likewise. > * libio/getc.c (_IO_getc): Likewise. > * libio/getchar.c (getchar): Likewise. > * libio/ioungetc.c (_IO_ungetc): Likewise. > * nptl/pthread_create.c (__pthread_create_2_1): Enable stdio locks. > * libio/iofopncook.c (_IO_fopencookie): Enable locking for the file. > * sysdeps/pthread/flockfile.c (__flockfile): Likewise. > > > stdio.diff > > > diff --git a/libio/Versions b/libio/Versions > index 2a1d6e6c85..6c1624e8cd 100644 > --- a/libio/Versions > +++ b/libio/Versions > @@ -155,5 +155,6 @@ libc { > GLIBC_PRIVATE { > # Used by NPTL and librt > __libc_fatal; > + _IO_enable_locks; > } > } > diff --git a/libio/feof.c b/libio/feof.c > index 9712a81e78..8890a5f51f 100644 > --- a/libio/feof.c > +++ b/libio/feof.c > @@ -32,6 +32,8 @@ _IO_feof (_IO_FILE *fp) > { > int result; > CHECK_FILE (fp, EOF); > + if (!_IO_need_lock (fp)) > + return _IO_feof_unlocked (fp); > _IO_flockfile (fp); > result = _IO_feof_unlocked (fp); > _IO_funlockfile (fp); > diff --git a/libio/ferror.c b/libio/ferror.c > index 01e3bd8e2b..d10fcd9fff 100644 > --- a/libio/ferror.c > +++ b/libio/ferror.c > @@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp) > { > int result; > CHECK_FILE (fp, EOF); > + if (!_IO_need_lock (fp)) > + return _IO_ferror_unlocked (fp); > _IO_flockfile (fp); > result = _IO_ferror_unlocked (fp); The patch looks OK except for the duplication (and a missing comment below), which looks a bit clumsy. How about something like this instead: bool need_lock = _IO_need_lock (fp); if (need_lock) _IO_flockfile (fp); result = _IO_ferror_unlocked (fp); if (need_lock) _IO_funlockfile (fp); return result; You could probably make some kind of a macro out of this, I haven't looked that hard. > _IO_funlockfile (fp); > diff --git a/libio/fputc.c b/libio/fputc.c > index a7cd682fe2..b72305c06f 100644 > --- a/libio/fputc.c > +++ b/libio/fputc.c > @@ -32,6 +32,8 @@ fputc (int c, _IO_FILE *fp) > { > int result; > CHECK_FILE (fp, EOF); > + if (!_IO_need_lock (fp)) > + return _IO_putc_unlocked (c, fp); > _IO_acquire_lock (fp); > result = _IO_putc_unlocked (c, fp); > _IO_release_lock (fp); > diff --git a/libio/genops.c b/libio/genops.c > index a466cfa337..132f1f1a7d 100644 > --- a/libio/genops.c > +++ b/libio/genops.c > @@ -570,11 +570,28 @@ _IO_init (_IO_FILE *fp, int flags) > _IO_init_internal (fp, flags); > } > Add a fat comment here describing in detail what you're doing here and why. > +static int stdio_needs_locking; > + > +void > +_IO_enable_locks (void) > +{ > + _IO_ITER i; > + > + if (stdio_needs_locking) > + return; > + stdio_needs_locking = 1; > + for (i = _IO_iter_begin (); i != _IO_iter_end (); i = _IO_iter_next (i)) > + _IO_iter_file (i)->_flags2 |= _IO_FLAGS2_NEED_LOCK; > +} > +libc_hidden_def (_IO_enable_locks) > + > void > _IO_old_init (_IO_FILE *fp, int flags) > { > fp->_flags = _IO_MAGIC|flags; > fp->_flags2 = 0; > + if (stdio_needs_locking) > + fp->_flags2 |= _IO_FLAGS2_NEED_LOCK; > fp->_IO_buf_base = NULL; > fp->_IO_buf_end = NULL; > fp->_IO_read_base = NULL; > diff --git a/libio/getc.c b/libio/getc.c > index b58fd62308..fd66ef93cf 100644 > --- a/libio/getc.c > +++ b/libio/getc.c > @@ -34,6 +34,8 @@ _IO_getc (FILE *fp) > { > int result; > CHECK_FILE (fp, EOF); > + if (!_IO_need_lock (fp)) > + return _IO_getc_unlocked (fp); > _IO_acquire_lock (fp); > result = _IO_getc_unlocked (fp); > _IO_release_lock (fp); > diff --git a/libio/getchar.c b/libio/getchar.c > index 5b41595d17..d79932114e 100644 > --- a/libio/getchar.c > +++ b/libio/getchar.c > @@ -33,6 +33,8 @@ int > getchar (void) > { > int result; > + if (!_IO_need_lock (_IO_stdin)) > + return _IO_getc_unlocked (_IO_stdin); > _IO_acquire_lock (_IO_stdin); > result = _IO_getc_unlocked (_IO_stdin); > _IO_release_lock (_IO_stdin); > diff --git a/libio/iofopncook.c b/libio/iofopncook.c > index a08dfdaa42..982f464a68 100644 > --- a/libio/iofopncook.c > +++ b/libio/iofopncook.c > @@ -172,6 +172,8 @@ _IO_cookie_init (struct _IO_cookie_file *cfile, int read_write, > _IO_mask_flags (&cfile->__fp.file, read_write, > _IO_NO_READS+_IO_NO_WRITES+_IO_IS_APPENDING); > > + cfile->__fp.file._flags2 |= _IO_FLAGS2_NEED_LOCK; > + > /* We use a negative number different from -1 for _fileno to mark that > this special stream is not associated with a real file, but still has > to be treated as such. */ > diff --git a/libio/ioungetc.c b/libio/ioungetc.c > index 951064fa12..917cad8abb 100644 > --- a/libio/ioungetc.c > +++ b/libio/ioungetc.c > @@ -33,6 +33,8 @@ _IO_ungetc (int c, _IO_FILE *fp) > CHECK_FILE (fp, EOF); > if (c == EOF) > return EOF; > + if (!_IO_need_lock (fp)) > + return _IO_sputbackc (fp, (unsigned char) c); > _IO_acquire_lock (fp); > result = _IO_sputbackc (fp, (unsigned char) c); > _IO_release_lock (fp); > diff --git a/libio/libio.h b/libio/libio.h > index 518ffd8e44..14bcb92332 100644 > --- a/libio/libio.h > +++ b/libio/libio.h > @@ -119,6 +119,7 @@ > # define _IO_FLAGS2_SCANF_STD 16 > # define _IO_FLAGS2_NOCLOSE 32 > # define _IO_FLAGS2_CLOEXEC 64 > +# define _IO_FLAGS2_NEED_LOCK 128 > #endif > > /* These are "formatting flags" matching the iostream fmtflags enum values. */ > @@ -451,6 +452,9 @@ extern int _IO_ftrylockfile (_IO_FILE *) __THROW; > #define _IO_cleanup_region_end(_Doit) /**/ > #endif > > +#define _IO_need_lock(_fp) \ > + (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0) > + > extern int _IO_vfscanf (_IO_FILE * __restrict, const char * __restrict, > _IO_va_list, int *__restrict); > extern int _IO_vfprintf (_IO_FILE *__restrict, const char *__restrict, > diff --git a/libio/libioP.h b/libio/libioP.h > index eb93418c8d..163dfb1386 100644 > --- a/libio/libioP.h > +++ b/libio/libioP.h > @@ -444,6 +444,8 @@ extern void _IO_list_unlock (void) __THROW; > libc_hidden_proto (_IO_list_unlock) > extern void _IO_list_resetlock (void) __THROW; > libc_hidden_proto (_IO_list_resetlock) > +extern void _IO_enable_locks (void) __THROW; > +libc_hidden_proto (_IO_enable_locks) > > /* Default jumptable functions. */ > > @@ -750,7 +752,7 @@ extern _IO_off64_t _IO_seekpos_unlocked (_IO_FILE *, _IO_off64_t, int) > > #if _G_HAVE_MMAP > > -# ifdef _LIBC > +# if IS_IN (libc) > /* When using this code in the GNU libc we must not pollute the name space. */ > # define mmap __mmap > # define munmap __munmap > diff --git a/libio/putc.c b/libio/putc.c > index b591c5538b..6e1fdeef3a 100644 > --- a/libio/putc.c > +++ b/libio/putc.c > @@ -25,6 +25,8 @@ _IO_putc (int c, _IO_FILE *fp) > { > int result; > CHECK_FILE (fp, EOF); > + if (!_IO_need_lock (fp)) > + return _IO_putc_unlocked (c, fp); > _IO_acquire_lock (fp); > result = _IO_putc_unlocked (c, fp); > _IO_release_lock (fp); > diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c > index c7d1b8f413..0b3fa942f2 100644 > --- a/nptl/pthread_create.c > +++ b/nptl/pthread_create.c > @@ -32,6 +32,7 @@ > #include <exit-thread.h> > #include <default-sched.h> > #include <futex-internal.h> > +#include "libioP.h" > > #include <shlib-compat.h> > > @@ -756,6 +757,9 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr, > collect_default_sched (pd); > } > > + if (__glibc_unlikely (__nptl_nthreads == 1)) > + _IO_enable_locks (); > + > /* Pass the descriptor to the caller. */ > *newthread = (pthread_t) pd; > > diff --git a/sysdeps/pthread/flockfile.c b/sysdeps/pthread/flockfile.c > index 7fe8e99161..a8e6c28ed9 100644 > --- a/sysdeps/pthread/flockfile.c > +++ b/sysdeps/pthread/flockfile.c > @@ -25,6 +25,7 @@ > void > __flockfile (FILE *stream) > { > + stream->_flags2 |= _IO_FLAGS2_NEED_LOCK; > _IO_lock_lock (*stream->_lock); > } > strong_alias (__flockfile, _IO_flockfile) >
On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote: > The patch looks OK except for the duplication (and a missing comment > below), which looks a bit clumsy. How about something like this instead: > > bool need_lock = _IO_need_lock (fp); > > if (need_lock) > _IO_flockfile (fp); > result = _IO_ferror_unlocked (fp); > if (need_lock) > _IO_funlockfile (fp); > > return result; > > You could probably make some kind of a macro out of this, I haven't > looked that hard. I forgot that Torvald had commented (off-list, the thread broke somehow) that it would be important to try and measure how much worse this makes the multi-threaded case worse. Siddhesh
On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote: > On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote: >> The patch looks OK except for the duplication (and a missing comment >> below), which looks a bit clumsy. How about something like this instead: >> >> bool need_lock = _IO_need_lock (fp); >> >> if (need_lock) >> _IO_flockfile (fp); >> result = _IO_ferror_unlocked (fp); >> if (need_lock) >> _IO_funlockfile (fp); >> >> return result; >> >> You could probably make some kind of a macro out of this, I haven't >> looked that hard. > > I forgot that Torvald had commented (off-list, the thread broke somehow) > that it would be important to try and measure how much worse this makes > the multi-threaded case worse. +1 If we are going to optimize the single threaded case we need to know what impact this has on the multi-threaded case.
On 29/06/17 12:41, Siddhesh Poyarekar wrote: >> @@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp) >> { >> int result; >> CHECK_FILE (fp, EOF); >> + if (!_IO_need_lock (fp)) >> + return _IO_ferror_unlocked (fp); >> _IO_flockfile (fp); >> result = _IO_ferror_unlocked (fp); > > The patch looks OK except for the duplication (and a missing comment > below), which looks a bit clumsy. How about something like this instead: > > bool need_lock = _IO_need_lock (fp); > > if (need_lock) > _IO_flockfile (fp); > result = _IO_ferror_unlocked (fp); > if (need_lock) > _IO_funlockfile (fp); > > return result; > > You could probably make some kind of a macro out of this, I haven't > looked that hard. > this does not do what we want: single check + tailcall the unlocked version. with your code gcc is more likely to do two checks and register spills and an additional return. while my code does not force gcc to do the right thing at least it shows the intent.
On 29/06/17 13:12, Carlos O'Donell wrote: > On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote: >> On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote: >>> The patch looks OK except for the duplication (and a missing comment >>> below), which looks a bit clumsy. How about something like this instead: >>> >>> bool need_lock = _IO_need_lock (fp); >>> >>> if (need_lock) >>> _IO_flockfile (fp); >>> result = _IO_ferror_unlocked (fp); >>> if (need_lock) >>> _IO_funlockfile (fp); >>> >>> return result; >>> >>> You could probably make some kind of a macro out of this, I haven't >>> looked that hard. >> >> I forgot that Torvald had commented (off-list, the thread broke somehow) >> that it would be important to try and measure how much worse this makes >> the multi-threaded case worse. > > +1 > > If we are going to optimize the single threaded case we need to know what > impact this has on the multi-threaded case. > note that this impacts multi-threaded case less than the lowlevellock approach that is currently implemented: that adds two checks, my code does one, that loads __libc_multiple_threads twice, mine checks a flag in fp, which is in a cache line that is most likely already accessed by the rest of the io code. i cannot produce numbers immediately as the last time i measured this, adding a dummy thread via an ldpreloaded lib had more effect on the timing of the same binary than adding a branch in the stdio code (i'm not sure why the additional thread affects timing so much with the current code, it might be a cpu issue, e.g. cache aliasing caused by slightly different layout of the loaded libs, but it also shows that the effect of the patch is small)
On Thursday 29 June 2017 06:59 PM, Szabolcs Nagy wrote: > this does not do what we want: > single check + tailcall the unlocked version. > > with your code gcc is more likely to do two > checks and register spills and an additional > return. > > while my code does not force gcc to do the > right thing at least it shows the intent. Can you verify this both ways? If they don't show any noticeable difference then I prefer the way I suggested, but that is really just for aesthetics. Maybe also provide a __glibc_unlikely hint to aid the compiler. Siddhesh
On 29/06/17 13:12, Carlos O'Donell wrote: > On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote: >> On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote: >>> The patch looks OK except for the duplication (and a missing comment >>> below), which looks a bit clumsy. How about something like this instead: >>> >>> bool need_lock = _IO_need_lock (fp); >>> >>> if (need_lock) >>> _IO_flockfile (fp); >>> result = _IO_ferror_unlocked (fp); >>> if (need_lock) >>> _IO_funlockfile (fp); >>> >>> return result; >>> >>> You could probably make some kind of a macro out of this, I haven't >>> looked that hard. >> >> I forgot that Torvald had commented (off-list, the thread broke somehow) >> that it would be important to try and measure how much worse this makes >> the multi-threaded case worse. > > +1 > > If we are going to optimize the single threaded case we need to know what > impact this has on the multi-threaded case. > $orig == current $stdio == my patch $stdio_mt == my patch but 'needs lock' flag is set so multithread path is taken on two particular aarch64 cpus with a particular loop count: cpu1 time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar 8.08user 0.04system 0:08.12elapsed 100%CPU (0avgtext+0avgdata 1472maxresident)k 0inputs+0outputs (0major+40minor)pagefaults 0swaps time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar 1.07user 0.04system 0:01.11elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k 0inputs+0outputs (0major+40minor)pagefaults 0swaps time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar 7.87user 0.00system 0:07.88elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k 0inputs+0outputs (0major+40minor)pagefaults 0swaps cpu2 time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar 8.11user 0.04system 0:08.16elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k 0inputs+0outputs (0major+40minor)pagefaults 0swaps time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar 2.29user 0.06system 0:02.35elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k 0inputs+0outputs (0major+40minor)pagefaults 0swaps time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar 8.12user 0.03system 0:08.16elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k 0inputs+0outputs (0major+40minor)pagefaults 0swaps on a particular x86_64 cpu with particular loop count: time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k 0inputs+0outputs (0major+153minor)pagefaults 0swaps time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k 0inputs+0outputs (0major+155minor)pagefaults 0swaps time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar 6.76user 0.08system 0:06.87elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k 0inputs+0outputs (0major+155minor)pagefaults 0swaps in summary: on aarch64 i see no regression (in some case stdio_mt become faster, can happen since the layout of the code changed) on this particular x86 cpu stdio_mt has a close to 15% regression. i don't believe the big regression on x86 is valid, it could be that the benchmark just got past some cpu internal limit or the code got aligned differently, in fact if i static link the exact same code then on the same cpu i get time ./getchar_static-orig 6.60user 0.05system 0:06.66elapsed 99%CPU (0avgtext+0avgdata 912maxresident)k 0inputs+0outputs (0major+81minor)pagefaults 0swaps time ./getchar_static-stdio 2.24user 0.08system 0:02.33elapsed 99%CPU (0avgtext+0avgdata 896maxresident)k 0inputs+0outputs (0major+81minor)pagefaults 0swaps time ./getchar_static-stdio_mt 6.50user 0.06system 0:06.57elapsed 99%CPU (0avgtext+0avgdata 896maxresident)k 0inputs+0outputs (0major+81minor)pagefaults 0swaps i.e. now it is faster from the branch! (both measurements are repeatable) i didn't dig into the root cause of the regression (or why is static linking slower?), i would not be too worried about it since the common case for hot stdio loops is in single thread processes where even on x86 the patch gives >2x speedup.
On 29/06/17 15:06, Siddhesh Poyarekar wrote: > On Thursday 29 June 2017 06:59 PM, Szabolcs Nagy wrote: >> this does not do what we want: >> single check + tailcall the unlocked version. >> >> with your code gcc is more likely to do two >> checks and register spills and an additional >> return. >> >> while my code does not force gcc to do the >> right thing at least it shows the intent. > > Can you verify this both ways? If they don't show any noticeable > difference then I prefer the way I suggested, but that is really just > for aesthetics. Maybe also provide a __glibc_unlikely hint to aid the > compiler. your code layout does not work for most io apis as they use _IO_acquire_lock/unlock which create a new scope for cancel cleanup handling. it only works with non-cancellation point apis (flag checks: feof, ferror) that just flockfile/funlockfile. i can change those two functions where it works but i dont think it will be any more maintainable.
On 06/30/2017 08:15 AM, Szabolcs Nagy wrote: > i didn't dig into the root cause of the regression (or > why is static linking slower?), i would not be too > worried about it since the common case for hot stdio > loops is in single thread processes where even on x86 > the patch gives >2x speedup. Regardless of the cause, the 15% regression on x86 MT performance is serious, and I see no reason to push this into glibc 2.26. We can add it any time in 2.27, or the distros can pick it up with a backport. I would like to see a better characterization of the regression before accepting this patch. While I agree that common case for hot stdio loops is non-MT, there are still MT cases, and 15% is a large double-digit loss. Have you looked at the assembly differences? What is the compiler doing differently? When our a user asks "Why is my MT stdio 15% slower?" We owe them an answer that is clear and concise.
On 30/06/2017 10:16, Carlos O'Donell wrote: > On 06/30/2017 08:15 AM, Szabolcs Nagy wrote: >> i didn't dig into the root cause of the regression (or >> why is static linking slower?), i would not be too >> worried about it since the common case for hot stdio >> loops is in single thread processes where even on x86 >> the patch gives >2x speedup. > > Regardless of the cause, the 15% regression on x86 MT performance > is serious, and I see no reason to push this into glibc 2.26. > We can add it any time in 2.27, or the distros can pick it up with > a backport. > > I would like to see a better characterization of the regression before > accepting this patch. > > While I agree that common case for hot stdio loops is non-MT, there > are still MT cases, and 15% is a large double-digit loss. > > Have you looked at the assembly differences? What is the compiler > doing differently? > > When our a user asks "Why is my MT stdio 15% slower?" We owe them an > answer that is clear and concise. > Just a wild guess, but might some overhead due on x86_64 the internal locks will now unnecessary check for single-thread case (the __lll_lock_asm_start [1] snippet). Did you try by using the simple variant, without the check for __libc_multiple_threads? [1] ./sysdeps/unix/sysv/linux/x86_64/lowlevellock.h
On 30/06/17 14:16, Carlos O'Donell wrote: > On 06/30/2017 08:15 AM, Szabolcs Nagy wrote: >> i didn't dig into the root cause of the regression (or >> why is static linking slower?), i would not be too >> worried about it since the common case for hot stdio >> loops is in single thread processes where even on x86 >> the patch gives >2x speedup. > > Regardless of the cause, the 15% regression on x86 MT performance > is serious, and I see no reason to push this into glibc 2.26. > We can add it any time in 2.27, or the distros can pick it up with > a backport. > > I would like to see a better characterization of the regression before > accepting this patch. > > While I agree that common case for hot stdio loops is non-MT, there > are still MT cases, and 15% is a large double-digit loss. > > Have you looked at the assembly differences? What is the compiler > doing differently? > > When our a user asks "Why is my MT stdio 15% slower?" We owe them an > answer that is clear and concise. > sorry the x86 measurement was bogus because only the high level code thought it's multithreaded, the lowlevellock code thought it's single threaded so there were no atomic ops executed in the stdio_mt case with atomics the orig performance is significantly slower so the regression relative to that is small in %. if i create a dummy thread (to measure true mt behaviour, same loop count): time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt 20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k 0inputs+0outputs (0major+180minor)pagefaults 0swaps time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt 20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k 0inputs+0outputs (0major+179minor)pagefaults 0swaps the relative diff is 2% now, but notice that the abs diff went down too (which points to uarch issue in the previous measurement). perf stat indicates that there are 15 vs 16 branches in the loop (so my patch indeed adds one branch but there are plenty branches already) the instruction count goes from 43 to 45 per loop iteration (flag check + branch). in my previous measurements, how can +1 branch decrease the performance >10% when there are already >10 branches (and several other insns) is something the x86 uarchitects could explain. in summary the patch trades 2% mt performance to 2x non-mt performance on this x86 cpu.
On Friday 30 June 2017 06:08 PM, Szabolcs Nagy wrote: > your code layout does not work for most > io apis as they use _IO_acquire_lock/unlock > which create a new scope for cancel cleanup > handling. > > it only works with non-cancellation point > apis (flag checks: feof, ferror) that just > flockfile/funlockfile. > > i can change those two functions where it > works but i dont think it will be any more > maintainable. Right, that won't work. There should be a cleaner way to do both, but I can't think of one right now, so I'll let that go. So the only open points right now is the comment I asked you to add describing what/why you're doing this for stdio and the performance question on x86. So Carlos, etc. who are monitoring this will tell you when they're satisfied with your updated results. Siddhesh
On Fri, 2017-06-30 at 13:15 +0100, Szabolcs Nagy wrote: > on a particular x86_64 cpu with particular loop count: > > time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar > 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k > 0inputs+0outputs (0major+153minor)pagefaults 0swaps > time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar > 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k > 0inputs+0outputs (0major+155minor)pagefaults 0swaps What's interesting here is that your high-level optimization is faster than doing the single-thread check in the low-level lock (x86 has it already in the low-level lock).
On Friday 30 June 2017 08:48 PM, Torvald Riegel wrote: > What's interesting here is that your high-level optimization is faster > than doing the single-thread check in the low-level lock (x86 has it > already in the low-level lock). Cache locality might be a factor - the fp struct may already be in cache as opposed to the TLS block from which one gets the value of multiple_threads. Siddhesh
On Fri, Jun 30, 2017 at 4:18 PM, Torvald Riegel <triegel@redhat.com> wrote: > On Fri, 2017-06-30 at 13:15 +0100, Szabolcs Nagy wrote: >> on a particular x86_64 cpu with particular loop count: >> >> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar >> 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k >> 0inputs+0outputs (0major+153minor)pagefaults 0swaps >> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar >> 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k >> 0inputs+0outputs (0major+155minor)pagefaults 0swaps > > What's interesting here is that your high-level optimization is faster > than doing the single-thread check in the low-level lock (x86 has it > already in the low-level lock). Isn't that the expected behaviour here ? Ramana >
On Fri, 2017-06-30 at 20:54 +0530, Siddhesh Poyarekar wrote: > On Friday 30 June 2017 08:48 PM, Torvald Riegel wrote: > > What's interesting here is that your high-level optimization is faster > > than doing the single-thread check in the low-level lock (x86 has it > > already in the low-level lock). > > Cache locality might be a factor - the fp struct may already be in cache > as opposed to the TLS block from which one gets the value of > multiple_threads. And the check can happen earlier in the code, and at a higher level.
On 30/06/17 16:18, Torvald Riegel wrote: > On Fri, 2017-06-30 at 13:15 +0100, Szabolcs Nagy wrote: >> on a particular x86_64 cpu with particular loop count: >> >> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar >> 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k >> 0inputs+0outputs (0major+153minor)pagefaults 0swaps >> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar >> 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k >> 0inputs+0outputs (0major+155minor)pagefaults 0swaps > > What's interesting here is that your high-level optimization is faster > than doing the single-thread check in the low-level lock (x86 has it > already in the low-level lock). > musl implemented stdio this way since 2011, which is the reason for the 2x in getc/putc (i386): http://www.etalabs.net/compare_libcs.html we knew this already, but pushing a high level opt in glibc is harder than doing target specific lll opt. musl does not have to worry about fopencookie, libc internal malloc interopsition, printf hooks, magic IO flags, target specific hacks, so the optimization is more obviously correct there, but now that the patch only omits locks in specific functions the correctness should be easier to verify.
On 06/30/2017 10:39 AM, Szabolcs Nagy wrote: > On 30/06/17 14:16, Carlos O'Donell wrote: >> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote: >>> i didn't dig into the root cause of the regression (or >>> why is static linking slower?), i would not be too >>> worried about it since the common case for hot stdio >>> loops is in single thread processes where even on x86 >>> the patch gives >2x speedup. >> >> Regardless of the cause, the 15% regression on x86 MT performance >> is serious, and I see no reason to push this into glibc 2.26. >> We can add it any time in 2.27, or the distros can pick it up with >> a backport. >> >> I would like to see a better characterization of the regression before >> accepting this patch. >> >> While I agree that common case for hot stdio loops is non-MT, there >> are still MT cases, and 15% is a large double-digit loss. >> >> Have you looked at the assembly differences? What is the compiler >> doing differently? >> >> When our a user asks "Why is my MT stdio 15% slower?" We owe them an >> answer that is clear and concise. >> > > sorry the x86 measurement was bogus because only > the high level code thought it's multithreaded, the > lowlevellock code thought it's single threaded so > there were no atomic ops executed in the stdio_mt case OK. > with atomics the orig performance is significantly > slower so the regression relative to that is small in %. > > if i create a dummy thread (to measure true mt > behaviour, same loop count): > > time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt > 20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k > 0inputs+0outputs (0major+180minor)pagefaults 0swaps > time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt > 20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k > 0inputs+0outputs (0major+179minor)pagefaults 0swaps > > the relative diff is 2% now, but notice that the > abs diff went down too (which points to uarch issue > in the previous measurement). OK. This is much better. > perf stat indicates that there are 15 vs 16 branches > in the loop (so my patch indeed adds one branch > but there are plenty branches already) the instruction > count goes from 43 to 45 per loop iteration > (flag check + branch). > > in my previous measurements, how can +1 branch > decrease the performance >10% when there are > already >10 branches (and several other insns) > is something the x86 uarchitects could explain. > > in summary the patch trades 2% mt performance to > 2x non-mt performance on this x86 cpu. Excellent, this is exactly the analysis I was looking for, and this kind of result is something that can make sense to our users. I'm OK with the patch for 2.26.
diff --git a/libio/Versions b/libio/Versions index 2a1d6e6c85..6c1624e8cd 100644 --- a/libio/Versions +++ b/libio/Versions @@ -155,5 +155,6 @@ libc { GLIBC_PRIVATE { # Used by NPTL and librt __libc_fatal; + _IO_enable_locks; } } diff --git a/libio/feof.c b/libio/feof.c index 9712a81e78..8890a5f51f 100644 --- a/libio/feof.c +++ b/libio/feof.c @@ -32,6 +32,8 @@ _IO_feof (_IO_FILE *fp) { int result; CHECK_FILE (fp, EOF); + if (!_IO_need_lock (fp)) + return _IO_feof_unlocked (fp); _IO_flockfile (fp); result = _IO_feof_unlocked (fp); _IO_funlockfile (fp); diff --git a/libio/ferror.c b/libio/ferror.c index 01e3bd8e2b..d10fcd9fff 100644 --- a/libio/ferror.c +++ b/libio/ferror.c @@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp) { int result; CHECK_FILE (fp, EOF); + if (!_IO_need_lock (fp)) + return _IO_ferror_unlocked (fp); _IO_flockfile (fp); result = _IO_ferror_unlocked (fp); _IO_funlockfile (fp); diff --git a/libio/fputc.c b/libio/fputc.c index a7cd682fe2..b72305c06f 100644 --- a/libio/fputc.c +++ b/libio/fputc.c @@ -32,6 +32,8 @@ fputc (int c, _IO_FILE *fp) { int result; CHECK_FILE (fp, EOF); + if (!_IO_need_lock (fp)) + return _IO_putc_unlocked (c, fp); _IO_acquire_lock (fp); result = _IO_putc_unlocked (c, fp); _IO_release_lock (fp); diff --git a/libio/genops.c b/libio/genops.c index a466cfa337..132f1f1a7d 100644 --- a/libio/genops.c +++ b/libio/genops.c @@ -570,11 +570,28 @@ _IO_init (_IO_FILE *fp, int flags) _IO_init_internal (fp, flags); } +static int stdio_needs_locking; + +void +_IO_enable_locks (void) +{ + _IO_ITER i; + + if (stdio_needs_locking) + return; + stdio_needs_locking = 1; + for (i = _IO_iter_begin (); i != _IO_iter_end (); i = _IO_iter_next (i)) + _IO_iter_file (i)->_flags2 |= _IO_FLAGS2_NEED_LOCK; +} +libc_hidden_def (_IO_enable_locks) + void _IO_old_init (_IO_FILE *fp, int flags) { fp->_flags = _IO_MAGIC|flags; fp->_flags2 = 0; + if (stdio_needs_locking) + fp->_flags2 |= _IO_FLAGS2_NEED_LOCK; fp->_IO_buf_base = NULL; fp->_IO_buf_end = NULL; fp->_IO_read_base = NULL; diff --git a/libio/getc.c b/libio/getc.c index b58fd62308..fd66ef93cf 100644 --- a/libio/getc.c +++ b/libio/getc.c @@ -34,6 +34,8 @@ _IO_getc (FILE *fp) { int result; CHECK_FILE (fp, EOF); + if (!_IO_need_lock (fp)) + return _IO_getc_unlocked (fp); _IO_acquire_lock (fp); result = _IO_getc_unlocked (fp); _IO_release_lock (fp); diff --git a/libio/getchar.c b/libio/getchar.c index 5b41595d17..d79932114e 100644 --- a/libio/getchar.c +++ b/libio/getchar.c @@ -33,6 +33,8 @@ int getchar (void) { int result; + if (!_IO_need_lock (_IO_stdin)) + return _IO_getc_unlocked (_IO_stdin); _IO_acquire_lock (_IO_stdin); result = _IO_getc_unlocked (_IO_stdin); _IO_release_lock (_IO_stdin); diff --git a/libio/iofopncook.c b/libio/iofopncook.c index a08dfdaa42..982f464a68 100644 --- a/libio/iofopncook.c +++ b/libio/iofopncook.c @@ -172,6 +172,8 @@ _IO_cookie_init (struct _IO_cookie_file *cfile, int read_write, _IO_mask_flags (&cfile->__fp.file, read_write, _IO_NO_READS+_IO_NO_WRITES+_IO_IS_APPENDING); + cfile->__fp.file._flags2 |= _IO_FLAGS2_NEED_LOCK; + /* We use a negative number different from -1 for _fileno to mark that this special stream is not associated with a real file, but still has to be treated as such. */ diff --git a/libio/ioungetc.c b/libio/ioungetc.c index 951064fa12..917cad8abb 100644 --- a/libio/ioungetc.c +++ b/libio/ioungetc.c @@ -33,6 +33,8 @@ _IO_ungetc (int c, _IO_FILE *fp) CHECK_FILE (fp, EOF); if (c == EOF) return EOF; + if (!_IO_need_lock (fp)) + return _IO_sputbackc (fp, (unsigned char) c); _IO_acquire_lock (fp); result = _IO_sputbackc (fp, (unsigned char) c); _IO_release_lock (fp); diff --git a/libio/libio.h b/libio/libio.h index 518ffd8e44..14bcb92332 100644 --- a/libio/libio.h +++ b/libio/libio.h @@ -119,6 +119,7 @@ # define _IO_FLAGS2_SCANF_STD 16 # define _IO_FLAGS2_NOCLOSE 32 # define _IO_FLAGS2_CLOEXEC 64 +# define _IO_FLAGS2_NEED_LOCK 128 #endif /* These are "formatting flags" matching the iostream fmtflags enum values. */ @@ -451,6 +452,9 @@ extern int _IO_ftrylockfile (_IO_FILE *) __THROW; #define _IO_cleanup_region_end(_Doit) /**/ #endif +#define _IO_need_lock(_fp) \ + (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0) + extern int _IO_vfscanf (_IO_FILE * __restrict, const char * __restrict, _IO_va_list, int *__restrict); extern int _IO_vfprintf (_IO_FILE *__restrict, const char *__restrict, diff --git a/libio/libioP.h b/libio/libioP.h index eb93418c8d..163dfb1386 100644 --- a/libio/libioP.h +++ b/libio/libioP.h @@ -444,6 +444,8 @@ extern void _IO_list_unlock (void) __THROW; libc_hidden_proto (_IO_list_unlock) extern void _IO_list_resetlock (void) __THROW; libc_hidden_proto (_IO_list_resetlock) +extern void _IO_enable_locks (void) __THROW; +libc_hidden_proto (_IO_enable_locks) /* Default jumptable functions. */ @@ -750,7 +752,7 @@ extern _IO_off64_t _IO_seekpos_unlocked (_IO_FILE *, _IO_off64_t, int) #if _G_HAVE_MMAP -# ifdef _LIBC +# if IS_IN (libc) /* When using this code in the GNU libc we must not pollute the name space. */ # define mmap __mmap # define munmap __munmap diff --git a/libio/putc.c b/libio/putc.c index b591c5538b..6e1fdeef3a 100644 --- a/libio/putc.c +++ b/libio/putc.c @@ -25,6 +25,8 @@ _IO_putc (int c, _IO_FILE *fp) { int result; CHECK_FILE (fp, EOF); + if (!_IO_need_lock (fp)) + return _IO_putc_unlocked (c, fp); _IO_acquire_lock (fp); result = _IO_putc_unlocked (c, fp); _IO_release_lock (fp); diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c index c7d1b8f413..0b3fa942f2 100644 --- a/nptl/pthread_create.c +++ b/nptl/pthread_create.c @@ -32,6 +32,7 @@ #include <exit-thread.h> #include <default-sched.h> #include <futex-internal.h> +#include "libioP.h" #include <shlib-compat.h> @@ -756,6 +757,9 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr, collect_default_sched (pd); } + if (__glibc_unlikely (__nptl_nthreads == 1)) + _IO_enable_locks (); + /* Pass the descriptor to the caller. */ *newthread = (pthread_t) pd; diff --git a/sysdeps/pthread/flockfile.c b/sysdeps/pthread/flockfile.c index 7fe8e99161..a8e6c28ed9 100644 --- a/sysdeps/pthread/flockfile.c +++ b/sysdeps/pthread/flockfile.c @@ -25,6 +25,7 @@ void __flockfile (FILE *stream) { + stream->_flags2 |= _IO_FLAGS2_NEED_LOCK; _IO_lock_lock (*stream->_lock); } strong_alias (__flockfile, _IO_flockfile)