From patchwork Mon Dec  4 04:34:00 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jeff Law <jeffreyalaw@gmail.com>
X-Patchwork-Id: 81245
Return-Path: <gdb-patches-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 5027D385803B
	for <patchwork@sourceware.org>; Mon,  4 Dec 2023 04:34:25 +0000 (GMT)
X-Original-To: gdb-patches@sourceware.org
Delivered-To: gdb-patches@sourceware.org
Received: from mail-ot1-x32e.google.com (mail-ot1-x32e.google.com
 [IPv6:2607:f8b0:4864:20::32e])
 by sourceware.org (Postfix) with ESMTPS id BACA03858C42
 for <gdb-patches@sourceware.org>; Mon,  4 Dec 2023 04:34:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BACA03858C42
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org BACA03858C42
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2607:f8b0:4864:20::32e
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701664453; cv=none;
 b=rzsASnQuNTdg+WN/ohwRA1onJC94lf0YIjeAp7t4mlH77u9Cyvygd0S0tFOFuf2xwXA5zehho5btPMDpfdcfi9S8LnmViZtw2ZPVlHewuvKI73TTxovOb1wFyNtg+JXbSX5pTZCJIXJf+Oa9aKFaoSV0z5lwnhyoUQbOkZEG7Hg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1701664453; c=relaxed/simple;
 bh=29gz33iXyE8OHeXah+Bq7JhDMaOGn7+5QKwkTMDZ03E=;
 h=DKIM-Signature:Message-ID:Date:MIME-Version:From:To:Subject;
 b=G3hvnSQzRBeyIodfyx5XTqH5oqYlh+2/D/Ar6qIWUkCdMigDm2NJK+5+1rliudN99JR04C21VoqQZC6WE5RxQqirHj+3esckisOTQ4i5/QF7rVy29E8rM0MxQsUN6oCxfmeaAK+PD72K14bvnmeX34uwSBZQcIpqcKz7ugxhtcc=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-ot1-x32e.google.com with SMTP id
 46e09a7af769-6d7fa93afe9so2000459a34.2
 for <gdb-patches@sourceware.org>; Sun, 03 Dec 2023 20:34:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1701664445; x=1702269245; darn=sourceware.org;
 h=subject:to:content-language:from:user-agent:mime-version:date
 :message-id:from:to:cc:subject:date:message-id:reply-to;
 bh=D686oQw42DgyAD5spWA4/LOj0ZP3IbGybl6mietMXQE=;
 b=iPFDY+R36+Ok9XJKtLkkwlIQlvIxY8kD5bp3bJrx3B3D0X2IQZlzSMucZEKDkTSg0x
 NoCqFNoQ8tQbBgprX/oPE0/MPKEsDHjL9ROJKOWzEDcNbD/Fxl8FI3YRHlbwc037Tvop
 ewUamNZp645GgX3NFwkwqtZA7WNgoU0ckDavhJW1Bab0BTTlkL2e7KvpCJFMrJvMabtL
 0RO2kEnyLuPnlQAN1Qgt3ylfMyFsAIBB5nWZGVKHGqUTkeVokQiT/wVicWNt2yOSByjg
 KMWdOpcw9Xvmi3Y3yNbpYJHTKzi8zucivmVPjjnJaKwnmSwaIb2Ei9d2wV4ZSxCB2hpP
 55jg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1701664445; x=1702269245;
 h=subject:to:content-language:from:user-agent:mime-version:date
 :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=D686oQw42DgyAD5spWA4/LOj0ZP3IbGybl6mietMXQE=;
 b=O2bvwh2ecsmoNIhl8AFWj5fvr5d/vy2NYu60vTIeH+pQ4KGAOKM0v/oT9lWDjfIfwx
 3i1PeeLf+5LVeKHrb94/zoeIuRuC0RTXrD+clqJ4+nwn/V2NtgVo63IGBEMzews2jJgQ
 IUQFennP4tSHDkBl8N5pk66JgYaR7WLe7esZZILQzYKfnqbwDrievKq5w/PClm9frVWv
 6/keFcgmFxcNKlNWkWr1Opved5x/XvY0icPHdaBaYe3iS1ZeDJbaUjnwMIKXMi0Pduuq
 dVYqxpG6vRM+vVd559NNenfudfT8W1VkJy74NzmdJZ4SBfEiFiwAqguwE6ukXuYczUuG
 WRQg==
X-Gm-Message-State: AOJu0YwFqjIf9x88VR04Z59hUw5bRuc8DUJNYSFudEjQd+w5G7jL2WMc
 B+GyIgFyiSXIN1B7c47LhoNf/EQyu94ynw==
X-Google-Smtp-Source: 
 AGHT+IEyQyV3SLTeIXi8Lpymejv3wKx9SeJROdeOyX55m/5tMlvrjYWe6ChJplOiG7zUmLAsdGukeA==
X-Received: by 2002:a05:6870:6b07:b0:1fb:2e1:bf6a with SMTP id
 mt7-20020a0568706b0700b001fb02e1bf6amr3636833oab.5.1701664444898;
 Sun, 03 Dec 2023 20:34:04 -0800 (PST)
Received: from [172.31.0.109] ([136.36.130.248])
 by smtp.gmail.com with ESMTPSA id
 k28-20020a635a5c000000b00528db73ed70sm6705322pgm.3.2023.12.03.20.34.02
 for <gdb-patches@sourceware.org>
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Sun, 03 Dec 2023 20:34:04 -0800 (PST)
Message-ID: <41f2930c-60cd-4992-baec-f8bfee1439e4@gmail.com>
Date: Sun, 3 Dec 2023 21:34:00 -0700
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
From: Jeff Law <jeffreyalaw@gmail.com>
Content-Language: en-US
To: gdb-patches@sourceware.org
Subject: Improve performance of the H8 simulator
X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
Errors-To: gdb-patches-bounces+patchwork=sourceware.org@sourceware.org

Running the H8 port through the GCC testsuite currently takes 4h 30m on 
my fastest server -- that's roughly 1.5hrs per multilib tested and many 
tests are disabled for various reasons.

To put that 1.5hr/multilib in perspective, that's roughly 3X the time 
for other embedded targets.  Clearly something isn't working as well as 
it should.

A bit of digging with perf shows that we're spending a crazy amount of 
time decoding instructions in the H8 simulator.  It's not hard to see 
why -- basically we take a blob of instruction data, then try to match 
it to every instruction in the H8 opcode table starting at the 
beginning.  That table has ~8000 entries (each different addressing mode 
is considered a different instruction in the table).


Naturally my first thought was to sort the table and use a binary search 
to find the right entry.  That's made excessively complex due to the 
encoding on the H8.  Just getting the sort right would be much more 
complex than I'd consider advisable.

Another thought was to build a mapping to the right entry for all the 
instructions that can be disambiguated based on the first nibble (4 
bits) of instruction data and a mapping for those which can be 
disambiguated based on the first byte of instruction data.

That seemed feasible until I realized that the H8/SX did some truly 
horrid things with encoding branches in the 0x4XYY opcode space.  It 
uses an "always zero" bit in the offset to encode new semantic 
information.  So we can't select on just 0x4X.  Ugh!

We could always to a custom decoder.  I've done several through the 
years, they can be very fast.  But no way I can justify the time to do that.

So what I settled on was to first sort the opcode table by the first 
nibble, then find the index of the first instruction for each nibble. 
Decoding uses that index to start its search.  This cuts the overall 
build/test by more than half.

Next I adjusted the sort so that instructions that are not available on 
the current sub architecture are put at the end of the table.   This 
shaves another ~15% off the total cycle time.

The net of the two changes is on my fastest server we've gone from 4:30 
to 1:40 running the GCC testsuite.  Same test results before/after, of 
course.  It's still not fast, but it's a hell of a lot better.

OK for the trunk?

Thanks,
Jeff
* h8300/compile.c (nib_indices): New variable.
	(instruction_comparator): New function.
	(sort_opcodes_and_setup_nibble_indices): New function.
	(init_pointers): Call sort_opcodes_and_setup_nibble_indices.
	(decode): Use nib_indices to avoid some useless table searching.

diff --git a/sim/h8300/compile.c b/sim/h8300/compile.c
index a4b39ae3380..cc94e63c0fd 100644
--- a/sim/h8300/compile.c
+++ b/sim/h8300/compile.c
@@ -44,6 +44,11 @@
 
 int debug;
 
+/* Each entry in this array is an index into the main opcode
+   array for the first instruction starting with the given
+   4 bit nibble.  */
+static int nib_indices[16];
+
 static int memory_size;
 
 #define X(op, size)  (op * 4 + size)
@@ -388,14 +393,21 @@ decode (SIM_DESC sd, sim_cpu *cpu, int addr, unsigned char *data, decoded_inst *
   int reg[3]   = {0, 0, 0};
   int rdisp[3] = {0, 0, 0};
   int opnum;
+  int index;
   const struct h8_opcode *q;
 
   dst->dst.type = -1;
   dst->src.type = -1;
   dst->op3.type = -1;
 
-  /* Find the exact opcode/arg combo.  */
-  for (q = h8_opcodes; q->name; q++)
+  /* We speed up instruction decoding by caching an index into
+     the main opcode array for the first instruction with the
+     given 4 bit nibble.  */
+  index = nib_indices[(data[0] & 0xf0) >> 4];
+
+  /* Find the exact opcode/arg combo, starting with the precomputed
+     index.  Note this loop is performance sensitive.  */
+  for (q = &h8_opcodes[index]; q->name; q++)
     {
       const op_type *nib = q->data.nib;
       unsigned int len = 0;
@@ -1557,6 +1569,85 @@ store2 (SIM_DESC sd, ea_type *arg, int n)
   return store_1 (sd, arg, n, 1);
 }
 
+/* Callback for qsort.  We sort first based on availablity
+   (available instructions sort lower).  When availability state
+   is the same, then we use the first 4 bit nibble as a secondary
+   sort key.
+
+   We don't really care about 100% stability here, just that the
+   available instructions come first and all instrutions with
+   the same starting nibble are consecutive.
+
+   We could do even better by recording frequency information into the
+   main table and using that to sort within a nibble's group with the
+   highest frequency instructions appearing first.  */
+
+static int
+instruction_comparator (const void *p1_, const void *p2_)
+{
+  struct h8_opcode *p1 = (struct h8_opcode *)p1_;
+  struct h8_opcode *p2 = (struct h8_opcode *)p2_;
+
+  /* The 1st sort key is based on whether or not the
+     instruction is even available.  This reduces the
+     number of entries we have to look at in the common
+     case.  */
+  bool p1_available = !((p1->available == AV_H8SX && !h8300sxmode)
+			|| (p1->available == AV_H8S  && !h8300smode)
+			|| (p1->available == AV_H8H  && !h8300hmode));
+
+  bool p2_available = !((p2->available == AV_H8SX && !h8300sxmode)
+			|| (p2->available == AV_H8S  && !h8300smode)
+			|| (p2->available == AV_H8H  && !h8300hmode));
+
+  /* Sort so that available instructions come before unavailable
+     instructions.  */
+  if (p1_available != p2_available)
+    return p2_available - p1_available;
+
+  /* Secondarily sort based on the first opcode nibble.  */
+  return p1->data.nib[0] - p2->data.nib[0];
+}
+
+
+/* OPS is the opcode array, which is initially sorted by mnenomic.
+
+   Sort the array so that the instructions for the sub-architecture
+   are at the start and unavailable instructions are at the end.
+
+   Within the set of available instructions, further sort them based
+   on the first 4 bit nibble.
+
+   Then find the first index into OPS for each of the 16 possible
+   nibbles and record that into NIB_INDICES to speed up decoding.  */
+
+static void
+sort_opcodes_and_setup_nibble_indices (struct h8_opcode *ops)
+{
+  const struct h8_opcode *q;
+  int *indices;
+  int i;
+
+  /* First sort the OPS array.  */
+  for (i = 0, q = ops; q->name; q++, i++)
+    ;
+  qsort (ops, i, sizeof (struct h8_opcode), instruction_comparator);
+
+  /* Now walk the array caching the index of the first
+     occurrence of each 4 bit nibble.  */
+  memset (nib_indices, -1, sizeof (int) * 16);
+  for (i = 0, q = ops; q->name; q++, i++)
+    {
+      int nib = q->data.nib[0];
+
+      /* Record the location of the first entry with the right
+	 nibble count.  */
+      if (nib_indices[nib] == -1)
+	nib_indices[nib] = i;
+    }
+}
+
+
 /* Flag to be set whenever a new SIM_DESC object is created.  */
 static int init_pointers_needed = 1;
 
@@ -1639,6 +1730,9 @@ init_pointers (SIM_DESC sd)
 	  h8_set_reg (cpu, i, 0);
 	}
 
+      /* Sort the opcode table and create indices to speed up decode.  */
+      sort_opcodes_and_setup_nibble_indices (ops);
+
       init_pointers_needed = 0;
     }
 }