From patchwork Mon Nov 27 17:55:58 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Burgess <aburgess@redhat.com>
X-Patchwork-Id: 80837
Return-Path: <gdb-patches-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id D9555385AC1F
	for <patchwork@sourceware.org>; Mon, 27 Nov 2023 17:57:10 +0000 (GMT)
X-Original-To: gdb-patches@sourceware.org
Delivered-To: gdb-patches@sourceware.org
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTPS id F35BD3857723
 for <gdb-patches@sourceware.org>; Mon, 27 Nov 2023 17:56:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F35BD3857723
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org F35BD3857723
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701107776; cv=none;
 b=M8ZaSvfNL+OJNfyPWXsQYBc83Wjc0zSNYTAn671zN0yy/dXxY+GA8C5ychGAfJizO7tWPyDW7A5HBUhoCZJi4Q8ngFRxWHYFytAH+NY5NKxD0YXdMoM6eIXD4P93OGN7z+mCU64QTzv5ycx05jhC0nmcSbHEi0cKBr5ctjodsJ8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1701107776; c=relaxed/simple;
 bh=bKChCrVcymm1sOWCjGJ1HTTmLY6mIS4Lv4fvBmfbU4M=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=fS6xmizEe9Lm7ytN10mOhfL0buNX3P0KlWCZOFU+KF6tV4tJxKm1suQ7kD6epihurZiL8DLWSz9oW+X+It8ta4JVsaHVjoPSum6qx3kaEQ0M2zonK4tgUai1HBX6Qdf9MVH9UCreYro6s5gaoy+JOWwwzUkXZcYrVkD+Oold1Go=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
 s=mimecast20190719; t=1701107774;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 content-transfer-encoding:content-transfer-encoding:
 in-reply-to:in-reply-to:references:references;
 bh=hzeGzuUfHE39DY3rsd/4OEiW4XPM/BR06Iv2zxiK+OM=;
 b=daKy8zWCsaWCPQHSBMVLLmElXOdqjnrXctgceij3m7y53Rcxbsy9L17iP6ZICS+sLeKU9e
 fyWTX+scXLIgbwzeQ60c8zoRH+LY1tQEMmtv67fRD9G4FFEMZfwiPP8sS6PPQ5HZQ8Q3gH
 yiuDiXO9MtkbuPeJi7lBrDN7Fis3ERA=
Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com
 [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-110-3BIajJfNMoK1aGBpsL8xcg-1; Mon, 27 Nov 2023 12:56:10 -0500
X-MC-Unique: 3BIajJfNMoK1aGBpsL8xcg-1
Received: by mail-wm1-f71.google.com with SMTP id
 5b1f17b1804b1-40b3d81399dso15570885e9.1
 for <gdb-patches@sourceware.org>; Mon, 27 Nov 2023 09:56:10 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1701107769; x=1701712569;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=hzeGzuUfHE39DY3rsd/4OEiW4XPM/BR06Iv2zxiK+OM=;
 b=MOrg9xAQRjNnwV6svfDM6coBOj5aWLju7v2Rkj1X3DEzVGriZvYW8kquXLwNOxNSJH
 CwilvbEUGGHdYTT3g1U+0HMFxOUiIDREPcedCVVWa3XVpMYU5THfiZrveWyLpMRpyBlt
 nv4JWyrTm16VvWwBqxpwVViYhgLgoUzh/deWmUCZ/q7QiXMmkoeQEB75PfB9Ye9BLNnp
 LnLFTpyOIB7Q7/s5aQSHHDXzsxX2bNS8fLUWxSOOTzNzl0Ve+E1WcVOFDXlew3B9FWp8
 M8xlTLkSdS2e9G61Y263/O7mZ8WEe4dMB98xnrR6bvvByOJqC250Zx18BZuxCaXbcSBK
 c4qQ==
X-Gm-Message-State: AOJu0YzN37GE1BeMliB5NtzPy1xClHl7lyeGv59wB088b7GVJvu2Si4P
 wVPw0rE6BL3ZiM5uVR6BhBwxkcOprYEigJwEtY2Ci1GKVa9Dvw5fgEBFv9HzVilcStUkYqJbMJH
 RDl+rGwz+YrNPrI4fBM11YeiACA5C7XlqRvTc3BYC6i9Qg349Mh+DGN3Cr7nI7I3KJkH4DxFhB5
 /rDEcChw==
X-Received: by 2002:a05:600c:1912:b0:40b:47f3:589f with SMTP id
 j18-20020a05600c191200b0040b47f3589fmr2154095wmq.37.1701107769051;
 Mon, 27 Nov 2023 09:56:09 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IFd+/SHIlqYKjWUTGmWwuk8qGivF5ZQ8YsfujBdpKy9Xze3RVzoTlcGakDjAuHnLksA1tilQg==
X-Received: by 2002:a05:600c:1912:b0:40b:47f3:589f with SMTP id
 j18-20020a05600c191200b0040b47f3589fmr2154080wmq.37.1701107768631;
 Mon, 27 Nov 2023 09:56:08 -0800 (PST)
Received: from localhost (105.226.159.143.dyn.plus.net. [143.159.226.105])
 by smtp.gmail.com with ESMTPSA id
 l38-20020a05600c1d2600b0040b45282f88sm5265373wms.36.2023.11.27.09.56.07
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 27 Nov 2023 09:56:08 -0800 (PST)
From: Andrew Burgess <aburgess@redhat.com>
To: gdb-patches@sourceware.org
Cc: Andrew Burgess <aburgess@redhat.com>
Subject: [PATCH 4/7] gdb: reduce size of generated gdb-index file
Date: Mon, 27 Nov 2023 17:55:58 +0000
Message-Id: 
 <7700d3eccfdbd6105da398309ea8898fb658781b.1701107594.git.aburgess@redhat.com>
X-Mailer: git-send-email 2.25.4
In-Reply-To: <cover.1701107594.git.aburgess@redhat.com>
References: <cover.1701107594.git.aburgess@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 KAM_SHORT,
 KAM_STOCKGEN, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL,
 SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
Errors-To: gdb-patches-bounces+patchwork=sourceware.org@sourceware.org

I noticed in passing that out algorithm for generating the gdb-index
file is incorrect.  When building the hash table in add_index_entry we
count every incoming entry rehash when the number of entries gets too
large.  However, some of the incoming entries will be duplicates,
which don't actually result in new items being added to the hash
table.

As a result, we grow the gdb-index hash table far too often.

With an unmodified GDB, generating a gdb-index for GDB, I see a file
size of 90M, with a hash usage (in the generated index file) of just
2.6%.

With a patched GDB, generating a gdb-index for the _same_ GDB binary,
I now see a gdb-index file size of 30M, with a hash usage of 41.9%.

This is a 67% reduction in gdb-index file size.

Obviously, not every gdb-index file is going to see such big savings,
however, the larger a program, and the more symbols that are
duplicated between compilation units, the more GDB would over count,
and so, over-grow the index.

The gdb-index hash table we create has a minimum size of 1024, and
then we grow the hash when it is 75% full, doubling the hash table at
that time.  Given this, then we expect that either:

  a. The hash table is size 1024, and less than 75% full, or
  b. The hash table is between 37.5% and 75% full.

I've include a test that checks some of these constraints -- I've not
bothered to check the upper limit, and over full hash table isn't
really a problem here, but if the fill percentage is less than 37.5%
then this indicates that we've done something wrong (obviously, I also
check for the 1024 minimum size).
---
 gdb/dwarf2/index-write.c             |  29 ++++---
 gdb/testsuite/gdb.gdb/index-file.exp | 115 +++++++++++++++++++++++++++
 2 files changed, 134 insertions(+), 10 deletions(-)
 create mode 100644 gdb/testsuite/gdb.gdb/index-file.exp

diff --git a/gdb/dwarf2/index-write.c b/gdb/dwarf2/index-write.c
index c0867799f6d..5960bacf8fb 100644
--- a/gdb/dwarf2/index-write.c
+++ b/gdb/dwarf2/index-write.c
@@ -256,20 +256,29 @@ add_index_entry (struct mapped_symtab *symtab, const char *name,
 		 int is_static, gdb_index_symbol_kind kind,
 		 offset_type cu_index)
 {
-  offset_type cu_index_and_attrs;
+  symtab_index_entry *slot = &find_slot (symtab, name);
+  if (slot->name == NULL)
+    {
+      /* This is a new element in the hash table.  */
+      ++symtab->n_elements;
 
-  ++symtab->n_elements;
-  if (4 * symtab->n_elements / 3 >= symtab->data.size ())
-    hash_expand (symtab);
+      /* We might need to grow the hash table.  */
+      if (4 * symtab->n_elements / 3 >= symtab->data.size ())
+	{
+	  hash_expand (symtab);
 
-  symtab_index_entry &slot = find_slot (symtab, name);
-  if (slot.name == NULL)
-    {
-      slot.name = name;
+	  /* This element will have a different slot in the new table.  */
+	  slot = &find_slot (symtab, name);
+
+	  /* But it should still be a new element in the hash table.  */
+	  gdb_assert (slot->name == nullptr);
+	}
+
+      slot->name = name;
       /* index_offset is set later.  */
     }
 
-  cu_index_and_attrs = 0;
+  offset_type cu_index_and_attrs = 0;
   DW2_GDB_INDEX_CU_SET_VALUE (cu_index_and_attrs, cu_index);
   DW2_GDB_INDEX_SYMBOL_STATIC_SET_VALUE (cu_index_and_attrs, is_static);
   DW2_GDB_INDEX_SYMBOL_KIND_SET_VALUE (cu_index_and_attrs, kind);
@@ -281,7 +290,7 @@ add_index_entry (struct mapped_symtab *symtab, const char *name,
      the last entry pushed), but a symbol could have multiple kinds in one CU.
      To keep things simple we don't worry about the duplication here and
      sort and uniquify the list after we've processed all symbols.  */
-  slot.cu_indices.push_back (cu_index_and_attrs);
+  slot->cu_indices.push_back (cu_index_and_attrs);
 }
 
 /* See symtab_index_entry.  */
diff --git a/gdb/testsuite/gdb.gdb/index-file.exp b/gdb/testsuite/gdb.gdb/index-file.exp
new file mode 100644
index 00000000000..c6edd286fb9
--- /dev/null
+++ b/gdb/testsuite/gdb.gdb/index-file.exp
@@ -0,0 +1,115 @@
+# Copyright 2023 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# Load the GDB executable, and then 'save gdb-index', and make some
+# checks of the generated index file.
+
+load_lib selftest-support.exp
+
+# Can't save an index with readnow.
+require !readnow
+
+# A multiplier used to ensure slow tasks are less likely to timeout.
+set timeout_factor 20
+
+set filename [selftest_prepare]
+if { $filename eq "" } {
+    unsupported "${gdb_test_file_name}.exp"
+    return -1
+}
+
+with_timeout_factor $timeout_factor {
+    # Start GDB, load FILENAME.
+    clean_restart $filename
+}
+
+# Generate an index file.
+set dir1 [standard_output_file "index_1"]
+remote_exec host "mkdir -p ${dir1}"
+with_timeout_factor $timeout_factor {
+    gdb_test_no_output "save gdb-index $dir1" \
+	"create gdb-index file"
+}
+
+# Close GDB.
+gdb_exit
+
+# Validate that the index-file FILENAME has made efficient use of its
+# symbol hash table.  Calculate the number of symbols in the hash
+# table and the total hash table size.  The hash table starts with
+# 1024 entries, and then doubles each time it is filled to 75%.  At
+# 75% filled, doubling the size takes it to 37.5% filled.
+#
+# Thus, the hash table is correctly filled if:
+#  1. Its size is 1024 (i.e. it has not yet had its first doubling), or
+#  2. Its filled percentage is over 37%
+#
+# We could check that it is not over filled, but I don't as that's not
+# really an issue.  But we did once have a bug where the table was
+# doubled incorrectly, in which case we'd see a filled percentage of
+# around 2% in some cases, which is a huge waste of disk space.
+proc check_symbol_table_usage { filename } {
+    # Open the file in binary mode and read-only mode.
+    set fp [open $filename rb]
+
+    # Configure the channel to use binary translation.
+    fconfigure $fp -translation binary
+
+    # Read the first 8 bytes of the file, which contain the header of
+    # the index section.
+    set header [read $fp [expr 7 * 4]]
+
+    # Scan the header to get the version, the CU list offset, and the
+    # types CU list offset.
+    binary scan $header iiiiii version \
+	_ _ _ symbol_table_offset shortcut_offset
+
+    # The length of the symbol hash table (in entries).
+    set len [expr ($shortcut_offset - $symbol_table_offset) / 8]
+
+    # Now walk the hash table and count how many entries are in use.
+    set offset $symbol_table_offset
+    set count 0
+    while { $offset < $shortcut_offset } {
+	seek $fp $offset
+	set entry [read $fp 8]
+	binary scan $entry ii name_ptr flags
+	if { $name_ptr != 0 } {
+	    incr count
+	}
+
+	incr offset 8
+    }
+
+    # Close the file.
+    close $fp
+
+    # Calculate how full the cache is.
+    set pct [expr (100 * double($count)) / $len]
+
+    # Write our results out to the gdb.log.
+    verbose -log "Hash table size: $len"
+    verbose -log "Hash table entries: $count"
+    verbose -log "Percentage usage: $pct%"
+
+    # The minimum fill percentage is actually 37.5%, but we give TCL a
+    # little flexibility in case the FP maths give a result a little
+    # off.
+    gdb_assert { $len == 1024 || $pct > 37 } \
+	"symbol hash table usage"
+}
+
+set index_filename_base [file tail $filename]
+check_symbol_table_usage "$dir1/${index_filename_base}.gdb-index"