[2/2] x86: Add generic CPUID data dumper to ld.so --list-diagnostics

Message ID b9ec9217f3b7e6c02ed850f283b4c732c756a528.1694203757.git.fweimer@redhat.com
State Superseded
Headers
Series [1/2] elf: Wire up _dl_diagnostics_cpu_kernel |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed

Commit Message

Florian Weimer Sept. 8, 2023, 8:10 p.m. UTC
  This is surprisingly difficult to implement if the goal is to produce
reasonably sized output.  With the current approaches to output
compression (suppressing zeros and repeated results between CPUs,
folding ranges of identical subleaves, dealing with the %ecx
reflection issue), the output is less than 600 KiB even for systems
with 256 threads.

Tested on i686-linux-gnu and x86_64-linux-gnu.  Built with a fairly
broad build-many-glibcs.py subset (including both Hurd targets).

---
 manual/dynlink.texi                           |  86 +++-
 .../linux/x86/dl-diagnostics-cpu-kernel.c     | 457 ++++++++++++++++++
 2 files changed, 542 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
  

Comments

Noah Goldstein Sept. 10, 2023, 7:56 p.m. UTC | #1
On Fri, Sep 8, 2023 at 3:10 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> This is surprisingly difficult to implement if the goal is to produce
> reasonably sized output.  With the current approaches to output
> compression (suppressing zeros and repeated results between CPUs,
> folding ranges of identical subleaves, dealing with the %ecx
> reflection issue), the output is less than 600 KiB even for systems
> with 256 threads.
>
Maybe should just output a complete json?
Then users can pretty easily write scripts to extract the exact information
they are after. Or or the dumper can be extended in the future to let
the user specify fields/values to dump so it can be configured to be more
reasonable?

> Tested on i686-linux-gnu and x86_64-linux-gnu.  Built with a fairly
> broad build-many-glibcs.py subset (including both Hurd targets).
>
> ---
>  manual/dynlink.texi                           |  86 +++-
>  .../linux/x86/dl-diagnostics-cpu-kernel.c     | 457 ++++++++++++++++++
>  2 files changed, 542 insertions(+), 1 deletion(-)
>  create mode 100644 sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
>
> diff --git a/manual/dynlink.texi b/manual/dynlink.texi
> index 06a6c15533..1f02124722 100644
> --- a/manual/dynlink.texi
> +++ b/manual/dynlink.texi
> @@ -228,7 +228,91 @@ reported by the @code{uname} function.  @xref{Platform Type}.
>  @item x86.cpu_features.@dots{}
>  These items are specific to the i386 and x86-64 architectures.  They
>  reflect supported CPU features and information on cache geometry, mostly
> -collected using the @code{CPUID} instruction.
> +collected using the CPUID instruction.
> +
> +@item x86.processor[@var{index}].@dots{}
> +These are additional items for the i386 and x86-64 architectures, as
> +described below.  They mostly contain raw data from the CPUID
> +instruction.  The probes are performed for each active CPU for the
> +@code{ld.so} process, and data for different probed CPUs receives a
> +uniqe @var{index} value.  Some CPUID data is expected to differ from CPU
> +core to CPU core.  In some cases, CPUs are not correctly initialized and
> +indicate the presence of different feature sets.
> +
> +@item x86.processor[@var{index}].requested=@var{kernel-cpu}
> +The kernel is told to run the subsequent probing on the CPU numbered
> +@var{kernel-cpu}.  The values @var{kernel-cpu} and @var{index} can be
> +distinct if there are gaps in the process CPU affinity mask.  This line
> +is not included if CPU affinity mask information is not available.
> +
> +@item x86.processor[@var{index}].observed=@var{kernel-cpu}
> +This line reports the kernel CPU number @var{kernel-cpu} on which the
> +probing code initially ran.  This line is only printed if the requested
> +and observed kernel CPU numbers differ.  This can happen if the kernel
> +fails to act on a request to change the process CPU affinity mask.
> +
> +@item x86.processor[@var{index}].observed_node=@var{node}
> +This reports the observed NUMA node number, as reported by the
> +@code{getcpu} system call.  It is missing if the @code{getcpu} system
> +call failed.
> +
> +@item x86.processor[@var{index}].cpuid_leaves=@var{count}
> +This line indicates that @var{count} distinct CPUID leaves were
> +encountered.  (This reflects internal @code{ld.so} storage space, it
> +does not directly correspond to @code{CPUID} enumeration ranges.)
> +
> +@item x86.processor[@var{index}].ecx_limit=@var{value}
> +The CPUID data extraction code uses a brute-force approach to enumerate
> +subleaves (see the @samp{.subleaf_eax} lines below).  The last
> +@code{%rcx} value used in a CPUID query on this probed CPU was
> +@var{value}.
> +
> +@item x86.processor[@var{index}].cpuid.eax[@var{query_eax}].eax=@var{eax}
> +@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].ebx=@var{ebx}
> +@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].ecx=@var{ecx}
> +@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].edx=@var{edx}
> +These lines report the register contents after executing the CPUID
> +instruction with @samp{%rax == @var{query_eax}} and @samp{%rcx == 0} (a
> +@dfn{leaf}).  For the first probed CPU (with a zero @var{index}), only
> +leaves with non-zero register contents are reported.  For subsequent
> +CPUs, only leaves whose register contents differs from the previously
> +probed CPUs (with @var{index} one less) are reported.
> +
> +Basic and extended leaves are reported using the same syntax.  This
> +means there is a large jump in @var{query_eax} for the first reported
> +extended leaf.
> +
> +@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].eax=@var{eax}
> +@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ebx=@var{ebx}
> +@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ecx=@var{ecx}
> +@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].edx=@var{edx}
> +This is similar to the leaves above, but for a @dfn{subleaf}.  For
> +subleaves, the CPUID instruction is executed with @samp{%rax ==
> +@var{query_eax}} and @samp{%rcx == @var{query_ecx}}, so the result
> +depends on both register values.  The same rules about filtering zero
> +and identical results apply.
> +
> +@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].until_ecx=@var{ecx_limit}
> +Some CPUID results are the same regardless the @var{query_ecx} value.
> +If this situation is detected, a line with the @samp{.until_ecx}
> +selector ins included, and this indicates that the CPUID register
> +contents is the same for @code{%rcx} values between @var{query_ecx}
> +and @var{ecx_limit} (inclusive).
> +
> +@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ecx_query_mask=0xff
> +This line indicates that in an @samp{.until_ecx} range, the CPUID
> +instruction preserved the lowested 8 bits of the input @code{%rcx} in
> +the output @code{%rcx} registers.  Otherwise, the subleaves in the range
> +have identical values.  This special treatment is necessary to report
> +compact range information in case such copying occurs (because the
> +subleaves would otherwise be all different).
> +
> +@item x86.processor[@var{index}].xgetbv.ecx[@var{query_ecx}]=@var{result}
> +This line shows the 64-bit @var{result} value in the @code{%rdx:%rax}
> +register pair after executing the XGETBV instruction with @code{%rcx}
> +set to @var{query_ecx}.  Zero values and values matching the previously
> +probed CPU are omitted.  Nothing is printed if the system does not
> +support the XGETBV instruction.
>  @end table
>
>  @node Dynamic Linker Introspection
> diff --git a/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c b/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
> new file mode 100644
> index 0000000000..f84331b33b
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
> @@ -0,0 +1,457 @@
> +/* Print CPU/kernel diagnostics data in ld.so.  Version for x86.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <dl-diagnostics.h>
> +
> +#include <array_length.h>
> +#include <cpu-features.h>
> +#include <cpuid.h>
> +#include <ldsodefs.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <sysdep.h>
> +
> +/* Register arguments to CPUID.  Multiple ECX subleaf values yielding
> +   the same result are combined, to shorten the output.  Both
> +   identical matches (EAX to EDX are the same) and matches where EAX,
> +   EBX, EDX, and ECX are equal except in the lower byte, which must
> +   match the query ECX value.  The latter is needed to compress ranges
> +   on CPUs which preserve the lowest byte in ECX if an unknown leaf is
> +   queried.  */
> +struct cpuid_query
> +{
> +  unsigned int eax;
> +  unsigned ecx_first;
> +  unsigned ecx_last;
> +  bool ecx_preserves_query_byte;
> +};
> +
> +/* Single integer value that can be used for sorting/ordering
> +   comparisons.  Uses Q->eax and Q->ecx_first only because ecx_last is
> +   always greater than the previous ecx_first value and less than the
> +   subsequent one.  */
> +static inline unsigned long long int
> +cpuid_query_combined (struct cpuid_query *q)
> +{
> +  /* ecx can be -1 (that is, ~0U).  If this happens, this the only ecx
> +     value for this eax value, so the ordering does not matter.  */
> +  return ((unsigned long long int) q->eax << 32) | (unsigned int) q->ecx_first;
> +};
> +
> +/* Used for differential reporting of zero/non-zero values.  */
> +static const struct cpuid_registers cpuid_registers_zero;
> +
> +/* Register arguments to CPUID paired with the results that came back.  */
> +struct cpuid_query_result
> +{
> +  struct cpuid_query q;
> +  struct cpuid_registers r;
> +};
> +
> +/* During a first enumeration pass, we try to collect data for
> +  cpuid_initial_subleaf_limit subleaves per leaf/EAX value.  If we run
> +  out of space, we try once more with applying the lower limit.  */
> +enum { cpuid_main_leaf_limit = 128 };
> +enum { cpuid_initial_subleaf_limit = 512 };
> +enum { cpuid_subleaf_limit = 32 };
> +
> +/* Offset of the extended leaf area.  */
> +enum {cpuid_extended_leaf_offset = 0x80000000 };
> +
> +/* Collected CPUID data.  Everything is stored in a statically sized
> +   array that is sized so that the second pass will collect some data
> +   for all leaves, after the limit is applied.  On the second pass,
> +   ecx_limit is set to cpuid_subleaf_limit.  */
> +struct cpuid_collected_data
> +{
> +  unsigned int used;
> +  unsigned int ecx_limit;
> +  uint64_t xgetbv_ecx_0;
> +  struct cpuid_query_result qr[cpuid_main_leaf_limit
> +                               * 2 * cpuid_subleaf_limit];
> +};
> +
> +/* Fill in the result of a CPUID query.  Returns true if there is
> +   room, false if nothing could be stored.  */
> +static bool
> +_dl_diagnostics_cpuid_store (struct cpuid_collected_data *ccd,
> +                             unsigned eax, int ecx)
> +{
> +  if (ccd->used >= array_length (ccd->qr))
> +    return false;
> +
> +  /* Tentatively fill in the next value.  */
> +  __cpuid_count (eax, ecx,
> +                 ccd->qr[ccd->used].r.eax,
> +                 ccd->qr[ccd->used].r.ebx,
> +                 ccd->qr[ccd->used].r.ecx,
> +                 ccd->qr[ccd->used].r.edx);
> +
> +  /* If the ECX subleaf is next subleaf after the previous one (for
> +     the same leaf), and the values are the same, merge the result
> +     with the already-stored one.  Do this before skipping zero
> +     leaves, which avoids artifiacts for ECX == 256 queries.  */
> +  if (ccd->used > 0
> +      && ccd->qr[ccd->used - 1].q.eax == eax
> +      && ccd->qr[ccd->used - 1].q.ecx_last + 1 == ecx)
> +    {
> +      /* Exact match of the previous result. Ignore the value of
> +         ecx_preserves_query_byte if this is a singleton range so far
> +         because we can treat ECX as fixed if the same value repeats.  */
> +      if ((!ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte
> +           || (ccd->qr[ccd->used - 1].q.ecx_first
> +               == ccd->qr[ccd->used - 1].q.ecx_last))
> +          && memcmp (&ccd->qr[ccd->used - 1].r, &ccd->qr[ccd->used].r,
> +                     sizeof (ccd->qr[ccd->used].r)) == 0)
> +        {
> +          ccd->qr[ccd->used - 1].q.ecx_last = ecx;
> +          /* ECX is now fixed because the same value has been observed
> +             twice, even if we had a low-byte match before.  */
> +          ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte = false;
> +          return true;
> +        }
> +      /* Match except for the low byte in ECX, which must match the
> +         incoming ECX value.  */
> +      if (ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte
> +          && (ecx & 0xff) == (ccd->qr[ccd->used].r.ecx & 0xff)
> +          && ccd->qr[ccd->used].r.eax == ccd->qr[ccd->used - 1].r.eax
> +          && ccd->qr[ccd->used].r.ebx == ccd->qr[ccd->used - 1].r.ebx
> +          && ((ccd->qr[ccd->used].r.ecx & 0xffffff00)
> +              == (ccd->qr[ccd->used - 1].r.ecx & 0xffffff00))
> +          && ccd->qr[ccd->used].r.edx == ccd->qr[ccd->used - 1].r.edx)
> +        {
> +          ccd->qr[ccd->used - 1].q.ecx_last = ecx;
> +          return true;
> +        }
> +    }
> +
> +  /* Do not store zero results.  All-zero values usually mean that the
> +     subleaf is unsupported.  */
> +  if (ccd->qr[ccd->used].r.eax == 0
> +      && ccd->qr[ccd->used].r.ebx == 0
> +      && ccd->qr[ccd->used].r.ecx == 0
> +      && ccd->qr[ccd->used].r.edx == 0)
> +    return true;
> +
> +  /* The result needs to be stored.  Fill in the query parameters and
> +     consume the storage.  */
> +  ccd->qr[ccd->used].q.eax = eax;
> +  ccd->qr[ccd->used].q.ecx_first = ecx;
> +  ccd->qr[ccd->used].q.ecx_last = ecx;
> +  ccd->qr[ccd->used].q.ecx_preserves_query_byte
> +    = (ecx & 0xff) == (ccd->qr[ccd->used].r.ecx & 0xff);
> +  ++ccd->used;
> +  return true;
> +}
> +
> +/* Collected CPUID data into *CCD.  If LIMIT, apply per-leaf limits to
> +   avoid exceeding the pre-allocated space.  Return true if all data
> +   could be stored, false if the retrying without a limit is
> +   requested.  */
> +static bool
> +_dl_diagnostics_cpuid_collect_1 (struct cpuid_collected_data *ccd, bool limit)
> +{
> +  ccd->used = 0;
> +  ccd->ecx_limit
> +    = (limit ? cpuid_subleaf_limit : cpuid_initial_subleaf_limit) - 1;
> +  _dl_diagnostics_cpuid_store (ccd, 0x00, 0x00);
> +  if (ccd->used == 0)
> +    /* CPUID reported all 0.  Should not happen.  */
> +    return true;
> +  unsigned int maximum_leaf = ccd->qr[0x00].r.eax;
> +  if (limit && maximum_leaf >= cpuid_main_leaf_limit)
> +    maximum_leaf = cpuid_main_leaf_limit - 1;
> +
> +  for (unsigned int eax = 1; eax <= maximum_leaf; ++eax)
> +    {
> +      for (unsigned int ecx = 0; ecx <= ccd->ecx_limit; ++ecx)
> +        if (!_dl_diagnostics_cpuid_store (ccd, eax, ecx))
> +          return false;
> +    }
> +
> +  if (!_dl_diagnostics_cpuid_store (ccd, cpuid_extended_leaf_offset, 0x00))
> +    return false;
> +  maximum_leaf = ccd->qr[ccd->used - 1].r.eax;
> +  if (maximum_leaf < cpuid_extended_leaf_offset)
> +    /* No extended CPUID information.  */
> +    return true;
> +  if (limit
> +      && maximum_leaf - cpuid_extended_leaf_offset >= cpuid_main_leaf_limit)
> +    maximum_leaf = cpuid_extended_leaf_offset + cpuid_main_leaf_limit - 1;
> +  for (unsigned int eax = cpuid_extended_leaf_offset + 1;
> +       eax <= maximum_leaf; ++eax)
> +    {
> +      for (unsigned int ecx = 0; ecx <= ccd->ecx_limit; ++ecx)
> +        if (!_dl_diagnostics_cpuid_store (ccd, eax, ecx))
> +          return false;
> +    }
> +  return true;
> +}
> +
> +/* Call _dl_diagnostics_cpuid_collect_1 twice if necessary, the
> +   second time with the limit applied.  */
> +static void
> +_dl_diagnostics_cpuid_collect (struct cpuid_collected_data *ccd)
> +{
> +  if (!_dl_diagnostics_cpuid_collect_1 (ccd, false))
> +    _dl_diagnostics_cpuid_collect_1 (ccd, true);
> +
> +  /* Re-use the result of the official feature probing here.  */
> +  const struct cpu_features *cpu_features = __get_cpu_features ();
> +  if (CPU_FEATURES_CPU_P (cpu_features, OSXSAVE))
> +    {
> +      unsigned int xcrlow;
> +      unsigned int xcrhigh;
> +      asm ("xgetbv" : "=a" (xcrlow), "=d" (xcrhigh) : "c" (0));
> +      ccd->xgetbv_ecx_0 = ((uint64_t) xcrhigh << 32) + xcrlow;
> +    }
> +  else
> +    ccd->xgetbv_ecx_0 = 0;
> +}
> +
> +/* Print a CPUID register value (passed as REG_VALUE) if it differs
> +   from the expected REG_REFERENCE value.  PROCESSOR_INDEX is the
> +   process sequence number (always starting at zero; not a kernel ID).  */
> +static void
> +_dl_diagnostics_cpuid_print_reg (unsigned int processor_index,
> +                                 const struct cpuid_query *q,
> +                                 const char *reg_label, unsigned int reg_value,
> +                                 bool subleaf)
> +{
> +  if (subleaf)
> +    _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
> +                ".ecx[0x%x].%s=0x%x\n",
> +                processor_index, q->eax, q->ecx_first, reg_label, reg_value);
> +  else
> +    _dl_printf ("x86.processor[0x%x].cpuid.eax[0x%x].%s=0x%x\n",
> +                processor_index, q->eax, reg_label, reg_value);
> +}
> +
> +/* Print CPUID result values in *RESULT for the query in
> +   CCD->qr[CCD_IDX].  PROCESSOR_INDEX is the process sequence number
> +   (always starting at zero; not a kernel ID).  */
> +static void
> +_dl_diagnostics_cpuid_print_query (unsigned int processor_index,
> +                                   struct cpuid_collected_data *ccd,
> +                                   unsigned int ccd_idx,
> +                                   const struct cpuid_registers *result)
> +{
> +  /* Treat this as a value if subleaves if ecx isn't zero (maybe
> +     within the [ecx_fist, ecx_last] range), or if eax matches its
> +     neighbors.  If the range is [0, ecx_limit], then the subleaves
> +     are not distinct (independently of ecx_preserves_query_byte),
> +     so do not report them separately.  */
> +  struct cpuid_query *q = &ccd->qr[ccd_idx].q;
> +  bool subleaf = (q->ecx_first > 0
> +                  || (q->ecx_first != q->ecx_last
> +                      && !(q->ecx_first == 0 && q->ecx_last == ccd->ecx_limit))
> +                  || (ccd_idx > 0 && q->eax == ccd->qr[ccd_idx - 1].q.eax)
> +                  || (ccd_idx + 1 < ccd->used
> +                      && q->eax == ccd->qr[ccd_idx + 1].q.eax));
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "eax", result->eax,
> +                                   subleaf);
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "ebx", result->ebx,
> +                                   subleaf);
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "ecx", result->ecx,
> +                                   subleaf);
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "edx", result->edx,
> +                                   subleaf);
> +
> +  if (subleaf && q->ecx_first != q->ecx_last)
> +    {
> +      _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
> +                  ".ecx[0x%x].until_ecx=0x%x\n",
> +                  processor_index, q->eax, q->ecx_first, q->ecx_last);
> +      if (q->ecx_preserves_query_byte)
> +        _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
> +                    ".ecx[0x%x].ecx_query_mask=0xff\n",
> +                    processor_index, q->eax, q->ecx_first);
> +    }
> +}
> +
> +/* Perform differential reporting of the data in *CURRENT against
> +   *BASE.  REQUESTED_CPU is the kernel CPU ID the thread was
> +   configured to run on, or -1 if no configuration was possible.
> +   PROCESSOR_INDEX is the process sequence number (always starting at
> +   zero; not a kernel ID).  */
> +static void
> +_dl_diagnostics_cpuid_report (unsigned int processor_index, int requested_cpu,
> +                              struct cpuid_collected_data *current,
> +                              struct cpuid_collected_data *base)
> +{
> +  if (requested_cpu >= 0)
> +    _dl_printf ("x86.processor[0x%x].requested=0x%x\n",
> +                processor_index, requested_cpu);
> +
> +  /* Despite CPU pinning, the requested CPU number may be different
> +     from the one we are running on.  Some container hosts behave this
> +     way.  */
> +  {
> +    unsigned int cpu_number;
> +    unsigned int node_number;
> +    if (INTERNAL_SYSCALL_CALL (getcpu, &cpu_number, &node_number) >= 0)
> +      {
> +        if (cpu_number != requested_cpu)
> +          _dl_printf ("x86.processor[0x%x].observed=0x%x\n",
> +                      processor_index, cpu_number);
> +        _dl_printf ("x86.processor[0x%x].observed_node=0x%x\n",
> +                    processor_index, node_number);
> +      }
> +  }
> +
> +  _dl_printf ("x86.processor[0x%x].cpuid_leaves=0x%x\n",
> +              processor_index, current->used);
> +  _dl_printf ("x86.processor[0x%x].ecx_limit=0x%x\n",
> +              processor_index, current->ecx_limit);
> +
> +  unsigned int base_idx = 0;
> +  for (unsigned int current_idx = 0; current_idx < current->used;
> +       ++current_idx)
> +    {
> +      /* Report missing data on the current CPU as 0.  */
> +      unsigned long long int current_query
> +        = cpuid_query_combined (&current->qr[current_idx].q);
> +      while (base_idx < base->used
> +             && cpuid_query_combined (&base->qr[base_idx].q) < current_query)
> +      {
> +        _dl_diagnostics_cpuid_print_query (processor_index, base, base_idx,
> +                                           &cpuid_registers_zero);
> +        ++base_idx;
> +      }
> +
> +      if (base_idx < base->used
> +          && cpuid_query_combined (&base->qr[base_idx].q) == current_query)
> +        {
> +          _Static_assert (sizeof (struct cpuid_registers) == 4 * 4,
> +                          "no padding in struct cpuid_registers");
> +          if (current->qr[current_idx].q.ecx_last
> +              != base->qr[base_idx].q.ecx_last
> +              || memcmp (&current->qr[current_idx].r,
> +                         &base->qr[base_idx].r,
> +                         sizeof (struct cpuid_registers)) != 0)
> +              /* The ECX range or the values have changed.  Show the
> +                 new values.  */
> +            _dl_diagnostics_cpuid_print_query (processor_index,
> +                                               current, current_idx,
> +                                               &current->qr[current_idx].r);
> +          ++base_idx;
> +        }
> +      else
> +        /* Data is absent in the base reference.  Report the new data.  */
> +        _dl_diagnostics_cpuid_print_query (processor_index,
> +                                           current, current_idx,
> +                                           &current->qr[current_idx].r);
> +    }
> +
> +  if (current->xgetbv_ecx_0 != base->xgetbv_ecx_0)
> +    {
> +      /* Re-use the 64-bit printing routine.  */
> +      _dl_printf ("x86.processor[0x%x].", processor_index);
> +      _dl_diagnostics_print_labeled_value ("xgetbv.ecx[0x0]",
> +                                           current->xgetbv_ecx_0);
> +    }
> +}
> +
> +void
> +_dl_diagnostics_cpu_kernel (void)
> +{
> +#if !HAS_CPUID
> +  /* CPUID is not supported, so there is nothing to dump.  */
> +  if (__get_cpuid_max (0, 0) == 0)
> +    return;
> +#endif
> +
> +  /* The number of processors reported so far.  Note that is a count,
> +     not a kernel CPU number.  */
> +  unsigned int processor_index = 0;
> +
> +  /* Two copies of the data are used.  Data is written to the index
> +     (processor_index & 1).  The previous version against which the
> +     data dump is reported is at index !(processor_index & 1).  */
> +  struct cpuid_collected_data ccd[2];
> +
> +  /* The initial data is presumed to be all zero.  Zero results are
> +     not recorded.  */
> +  ccd[1].used = 0;
> +  ccd[1].xgetbv_ecx_0 = 0;
> +
> +  /* Run the CPUID probing on a specific CPU.  There are expected
> +     differences for encoding core IDs and topology information in
> +     CPUID output, but some firmware/kernel bugs also may result in
> +     asymmetric data across CPUs in some cases.
> +
> +     The CPU mask arrays are large enough for 4096 or 8192 CPUs, which
> +     should give ample space for future expansion.  */
> +  unsigned long int mask_reference[1024];
> +  int length_reference
> +    = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0,
> +                             sizeof (mask_reference), mask_reference);
> +
> +  /* A parallel bit mask that is used below to request running on a
> +     specific CPU.  */
> +  unsigned long int mask_request[array_length (mask_reference)];
> +
> +  if (length_reference >= sizeof (long))
> +    {
> +      /* The kernel is supposed to return a multiple of the word size.  */
> +      length_reference /= sizeof (long);
> +
> +      for (unsigned int i = 0; i < length_reference; ++i)
> +        {
> +          /* Iterate over the bits in mask_request[i] and process
> +             those that are set; j is the bit index, bitmask is the
> +             derived mask for the bit at this index.  */
> +          unsigned int j = 0;
> +          for (unsigned long int bitmask = 1; bitmask != 0; bitmask <<= 1, ++j)
> +            {
> +              mask_request[i] = mask_reference[i] & bitmask;
> +              if (mask_request[i])
> +                {
> +                  unsigned int mask_array_length
> +                    = (i + 1) * sizeof (unsigned long int);
> +                  if (INTERNAL_SYSCALL_CALL (sched_setaffinity, 0,
> +                                             mask_array_length,
> +                                             mask_request) == 0)
> +                    {
> +                      /* This is the CPU ID number used by the
> +                         kernel.  It should match the first result
> +                         from getcpu.  */
> +                      int requested_cpu = i * ULONG_WIDTH + j;
> +                      _dl_diagnostics_cpuid_collect
> +                        (&ccd[processor_index & 1]);
> +                      _dl_diagnostics_cpuid_report
> +                        (processor_index, requested_cpu,
> +                         &ccd[processor_index & 1],
> +                         &ccd[!(processor_index & 1)]);
> +                      ++processor_index;
> +                    }
> +                }
> +            }
> +          /* Reset the mask word, so that the mask has always
> +             population count one.  */
> +          mask_request[i] = 0;
> +        }
> +    }
> +
> +  /* Fallback if we could not deliberately select a CPU.  */
> +  if (processor_index == 0)
> +    {
> +      _dl_diagnostics_cpuid_collect (&ccd[0]);
> +      _dl_diagnostics_cpuid_report (processor_index, -1, &ccd[0], &ccd[1]);
> +    }
> +}
> --
> 2.41.0
>
  
Florian Weimer Sept. 11, 2023, 4:24 a.m. UTC | #2
* Noah Goldstein:

> On Fri, Sep 8, 2023 at 3:10 PM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> This is surprisingly difficult to implement if the goal is to produce
>> reasonably sized output.  With the current approaches to output
>> compression (suppressing zeros and repeated results between CPUs,
>> folding ranges of identical subleaves, dealing with the %ecx
>> reflection issue), the output is less than 600 KiB even for systems
>> with 256 threads.
>>
> Maybe should just output a complete json?

JSON cannot directly represent 64-bit integers, so it would need some
custom transformation for other parts of the --list-diagnostics output.

> Then users can pretty easily write scripts to extract the exact information
> they are after. Or or the dumper can be extended in the future to let
> the user specify fields/values to dump so it can be configured to be more
> reasonable?

I'm not sure what is unreasonable about the current implementation?  I
complained about how hard it is getting the data and distilling it into
something that is not a gigantic data blob.

To be clear, without only trivial zero-values suppression, brute-force
enumeration (cutting off at 512 subleaves) results in roughly 8 KiB of
raw data per *CPU*.  It's even larger for recent CPUs which have more of
the funny ECX behavior (where unsupported subleaves do not come back as
zero).

Thanks,
Florian
  
Adhemerval Zanella Netto Sept. 11, 2023, 4:08 p.m. UTC | #3
On 08/09/23 17:10, Florian Weimer wrote:
> This is surprisingly difficult to implement if the goal is to produce
> reasonably sized output.  With the current approaches to output
> compression (suppressing zeros and repeated results between CPUs,
> folding ranges of identical subleaves, dealing with the %ecx
> reflection issue), the output is less than 600 KiB even for systems
> with 256 threads.
> 
> Tested on i686-linux-gnu and x86_64-linux-gnu.  Built with a fairly
> broad build-many-glibcs.py subset (including both Hurd targets).
> 
> ---
>  manual/dynlink.texi                           |  86 +++-
>  .../linux/x86/dl-diagnostics-cpu-kernel.c     | 457 ++++++++++++++++++
>  2 files changed, 542 insertions(+), 1 deletion(-)
>  create mode 100644 sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
> 
> diff --git a/manual/dynlink.texi b/manual/dynlink.texi
> index 06a6c15533..1f02124722 100644
> --- a/manual/dynlink.texi
> +++ b/manual/dynlink.texi
> @@ -228,7 +228,91 @@ reported by the @code{uname} function.  @xref{Platform Type}.
>  @item x86.cpu_features.@dots{}
>  These items are specific to the i386 and x86-64 architectures.  They
>  reflect supported CPU features and information on cache geometry, mostly
> -collected using the @code{CPUID} instruction.
> +collected using the CPUID instruction.
> +
> +@item x86.processor[@var{index}].@dots{}
> +These are additional items for the i386 and x86-64 architectures, as
> +described below.  They mostly contain raw data from the CPUID
> +instruction.  The probes are performed for each active CPU for the
> +@code{ld.so} process, and data for different probed CPUs receives a
> +uniqe @var{index} value.  Some CPUID data is expected to differ from CPU
> +core to CPU core.  In some cases, CPUs are not correctly initialized and
> +indicate the presence of different feature sets.
> +
> +@item x86.processor[@var{index}].requested=@var{kernel-cpu}
> +The kernel is told to run the subsequent probing on the CPU numbered
> +@var{kernel-cpu}.  The values @var{kernel-cpu} and @var{index} can be
> +distinct if there are gaps in the process CPU affinity mask.  This line
> +is not included if CPU affinity mask information is not available.
> +
> +@item x86.processor[@var{index}].observed=@var{kernel-cpu}
> +This line reports the kernel CPU number @var{kernel-cpu} on which the
> +probing code initially ran.  This line is only printed if the requested
> +and observed kernel CPU numbers differ.  This can happen if the kernel
> +fails to act on a request to change the process CPU affinity mask.
> +
> +@item x86.processor[@var{index}].observed_node=@var{node}
> +This reports the observed NUMA node number, as reported by the
> +@code{getcpu} system call.  It is missing if the @code{getcpu} system
> +call failed.
> +
> +@item x86.processor[@var{index}].cpuid_leaves=@var{count}
> +This line indicates that @var{count} distinct CPUID leaves were
> +encountered.  (This reflects internal @code{ld.so} storage space, it
> +does not directly correspond to @code{CPUID} enumeration ranges.)
> +
> +@item x86.processor[@var{index}].ecx_limit=@var{value}
> +The CPUID data extraction code uses a brute-force approach to enumerate
> +subleaves (see the @samp{.subleaf_eax} lines below).  The last
> +@code{%rcx} value used in a CPUID query on this probed CPU was
> +@var{value}.
> +
> +@item x86.processor[@var{index}].cpuid.eax[@var{query_eax}].eax=@var{eax}
> +@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].ebx=@var{ebx}
> +@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].ecx=@var{ecx}
> +@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].edx=@var{edx}
> +These lines report the register contents after executing the CPUID
> +instruction with @samp{%rax == @var{query_eax}} and @samp{%rcx == 0} (a
> +@dfn{leaf}).  For the first probed CPU (with a zero @var{index}), only
> +leaves with non-zero register contents are reported.  For subsequent
> +CPUs, only leaves whose register contents differs from the previously
> +probed CPUs (with @var{index} one less) are reported.
> +
> +Basic and extended leaves are reported using the same syntax.  This
> +means there is a large jump in @var{query_eax} for the first reported
> +extended leaf.
> +
> +@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].eax=@var{eax}
> +@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ebx=@var{ebx}
> +@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ecx=@var{ecx}
> +@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].edx=@var{edx}
> +This is similar to the leaves above, but for a @dfn{subleaf}.  For
> +subleaves, the CPUID instruction is executed with @samp{%rax ==
> +@var{query_eax}} and @samp{%rcx == @var{query_ecx}}, so the result
> +depends on both register values.  The same rules about filtering zero
> +and identical results apply.
> +
> +@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].until_ecx=@var{ecx_limit}
> +Some CPUID results are the same regardless the @var{query_ecx} value.
> +If this situation is detected, a line with the @samp{.until_ecx}
> +selector ins included, and this indicates that the CPUID register
> +contents is the same for @code{%rcx} values between @var{query_ecx}
> +and @var{ecx_limit} (inclusive).
> +
> +@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ecx_query_mask=0xff
> +This line indicates that in an @samp{.until_ecx} range, the CPUID
> +instruction preserved the lowested 8 bits of the input @code{%rcx} in
> +the output @code{%rcx} registers.  Otherwise, the subleaves in the range
> +have identical values.  This special treatment is necessary to report
> +compact range information in case such copying occurs (because the
> +subleaves would otherwise be all different).
> +
> +@item x86.processor[@var{index}].xgetbv.ecx[@var{query_ecx}]=@var{result}
> +This line shows the 64-bit @var{result} value in the @code{%rdx:%rax}
> +register pair after executing the XGETBV instruction with @code{%rcx}
> +set to @var{query_ecx}.  Zero values and values matching the previously
> +probed CPU are omitted.  Nothing is printed if the system does not
> +support the XGETBV instruction.
>  @end table
>  
>  @node Dynamic Linker Introspection
> diff --git a/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c b/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
> new file mode 100644
> index 0000000000..f84331b33b
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
> @@ -0,0 +1,457 @@
> +/* Print CPU/kernel diagnostics data in ld.so.  Version for x86.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <dl-diagnostics.h>
> +
> +#include <array_length.h>
> +#include <cpu-features.h>
> +#include <cpuid.h>
> +#include <ldsodefs.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <sysdep.h>
> +
> +/* Register arguments to CPUID.  Multiple ECX subleaf values yielding
> +   the same result are combined, to shorten the output.  Both
> +   identical matches (EAX to EDX are the same) and matches where EAX,
> +   EBX, EDX, and ECX are equal except in the lower byte, which must
> +   match the query ECX value.  The latter is needed to compress ranges
> +   on CPUs which preserve the lowest byte in ECX if an unknown leaf is
> +   queried.  */
> +struct cpuid_query
> +{
> +  unsigned int eax;
> +  unsigned ecx_first;
> +  unsigned ecx_last;
> +  bool ecx_preserves_query_byte;
> +};
> +
> +/* Single integer value that can be used for sorting/ordering
> +   comparisons.  Uses Q->eax and Q->ecx_first only because ecx_last is
> +   always greater than the previous ecx_first value and less than the
> +   subsequent one.  */
> +static inline unsigned long long int
> +cpuid_query_combined (struct cpuid_query *q)
> +{
> +  /* ecx can be -1 (that is, ~0U).  If this happens, this the only ecx
> +     value for this eax value, so the ordering does not matter.  */
> +  return ((unsigned long long int) q->eax << 32) | (unsigned int) q->ecx_first;
> +};
> +
> +/* Used for differential reporting of zero/non-zero values.  */
> +static const struct cpuid_registers cpuid_registers_zero;
> +
> +/* Register arguments to CPUID paired with the results that came back.  */
> +struct cpuid_query_result
> +{
> +  struct cpuid_query q;
> +  struct cpuid_registers r;
> +};
> +
> +/* During a first enumeration pass, we try to collect data for
> +  cpuid_initial_subleaf_limit subleaves per leaf/EAX value.  If we run
> +  out of space, we try once more with applying the lower limit.  */
> +enum { cpuid_main_leaf_limit = 128 };
> +enum { cpuid_initial_subleaf_limit = 512 };
> +enum { cpuid_subleaf_limit = 32 };
> +
> +/* Offset of the extended leaf area.  */
> +enum {cpuid_extended_leaf_offset = 0x80000000 };
> +
> +/* Collected CPUID data.  Everything is stored in a statically sized
> +   array that is sized so that the second pass will collect some data
> +   for all leaves, after the limit is applied.  On the second pass,
> +   ecx_limit is set to cpuid_subleaf_limit.  */
> +struct cpuid_collected_data
> +{
> +  unsigned int used;
> +  unsigned int ecx_limit;
> +  uint64_t xgetbv_ecx_0;
> +  struct cpuid_query_result qr[cpuid_main_leaf_limit
> +                               * 2 * cpuid_subleaf_limit];
> +};
> +
> +/* Fill in the result of a CPUID query.  Returns true if there is
> +   room, false if nothing could be stored.  */
> +static bool
> +_dl_diagnostics_cpuid_store (struct cpuid_collected_data *ccd,
> +                             unsigned eax, int ecx)
> +{
> +  if (ccd->used >= array_length (ccd->qr))
> +    return false;
> +
> +  /* Tentatively fill in the next value.  */
> +  __cpuid_count (eax, ecx,
> +                 ccd->qr[ccd->used].r.eax,
> +                 ccd->qr[ccd->used].r.ebx,
> +                 ccd->qr[ccd->used].r.ecx,
> +                 ccd->qr[ccd->used].r.edx);
> +
> +  /* If the ECX subleaf is next subleaf after the previous one (for
> +     the same leaf), and the values are the same, merge the result
> +     with the already-stored one.  Do this before skipping zero
> +     leaves, which avoids artifiacts for ECX == 256 queries.  */
> +  if (ccd->used > 0
> +      && ccd->qr[ccd->used - 1].q.eax == eax
> +      && ccd->qr[ccd->used - 1].q.ecx_last + 1 == ecx)
> +    {
> +      /* Exact match of the previous result. Ignore the value of
> +         ecx_preserves_query_byte if this is a singleton range so far
> +         because we can treat ECX as fixed if the same value repeats.  */
> +      if ((!ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte
> +           || (ccd->qr[ccd->used - 1].q.ecx_first
> +               == ccd->qr[ccd->used - 1].q.ecx_last))
> +          && memcmp (&ccd->qr[ccd->used - 1].r, &ccd->qr[ccd->used].r,
> +                     sizeof (ccd->qr[ccd->used].r)) == 0)
> +        {
> +          ccd->qr[ccd->used - 1].q.ecx_last = ecx;
> +          /* ECX is now fixed because the same value has been observed
> +             twice, even if we had a low-byte match before.  */
> +          ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte = false;
> +          return true;
> +        }
> +      /* Match except for the low byte in ECX, which must match the
> +         incoming ECX value.  */
> +      if (ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte
> +          && (ecx & 0xff) == (ccd->qr[ccd->used].r.ecx & 0xff)
> +          && ccd->qr[ccd->used].r.eax == ccd->qr[ccd->used - 1].r.eax
> +          && ccd->qr[ccd->used].r.ebx == ccd->qr[ccd->used - 1].r.ebx
> +          && ((ccd->qr[ccd->used].r.ecx & 0xffffff00)
> +              == (ccd->qr[ccd->used - 1].r.ecx & 0xffffff00))
> +          && ccd->qr[ccd->used].r.edx == ccd->qr[ccd->used - 1].r.edx)
> +        {
> +          ccd->qr[ccd->used - 1].q.ecx_last = ecx;
> +          return true;
> +        }
> +    }
> +
> +  /* Do not store zero results.  All-zero values usually mean that the
> +     subleaf is unsupported.  */
> +  if (ccd->qr[ccd->used].r.eax == 0
> +      && ccd->qr[ccd->used].r.ebx == 0
> +      && ccd->qr[ccd->used].r.ecx == 0
> +      && ccd->qr[ccd->used].r.edx == 0)
> +    return true;
> +
> +  /* The result needs to be stored.  Fill in the query parameters and
> +     consume the storage.  */
> +  ccd->qr[ccd->used].q.eax = eax;
> +  ccd->qr[ccd->used].q.ecx_first = ecx;
> +  ccd->qr[ccd->used].q.ecx_last = ecx;
> +  ccd->qr[ccd->used].q.ecx_preserves_query_byte
> +    = (ecx & 0xff) == (ccd->qr[ccd->used].r.ecx & 0xff);
> +  ++ccd->used;
> +  return true;
> +}
> +
> +/* Collected CPUID data into *CCD.  If LIMIT, apply per-leaf limits to
> +   avoid exceeding the pre-allocated space.  Return true if all data
> +   could be stored, false if the retrying without a limit is
> +   requested.  */
> +static bool
> +_dl_diagnostics_cpuid_collect_1 (struct cpuid_collected_data *ccd, bool limit)
> +{
> +  ccd->used = 0;
> +  ccd->ecx_limit
> +    = (limit ? cpuid_subleaf_limit : cpuid_initial_subleaf_limit) - 1;
> +  _dl_diagnostics_cpuid_store (ccd, 0x00, 0x00);
> +  if (ccd->used == 0)
> +    /* CPUID reported all 0.  Should not happen.  */
> +    return true;
> +  unsigned int maximum_leaf = ccd->qr[0x00].r.eax;
> +  if (limit && maximum_leaf >= cpuid_main_leaf_limit)
> +    maximum_leaf = cpuid_main_leaf_limit - 1;
> +
> +  for (unsigned int eax = 1; eax <= maximum_leaf; ++eax)
> +    {
> +      for (unsigned int ecx = 0; ecx <= ccd->ecx_limit; ++ecx)
> +        if (!_dl_diagnostics_cpuid_store (ccd, eax, ecx))
> +          return false;
> +    }
> +
> +  if (!_dl_diagnostics_cpuid_store (ccd, cpuid_extended_leaf_offset, 0x00))
> +    return false;
> +  maximum_leaf = ccd->qr[ccd->used - 1].r.eax;
> +  if (maximum_leaf < cpuid_extended_leaf_offset)
> +    /* No extended CPUID information.  */
> +    return true;
> +  if (limit
> +      && maximum_leaf - cpuid_extended_leaf_offset >= cpuid_main_leaf_limit)
> +    maximum_leaf = cpuid_extended_leaf_offset + cpuid_main_leaf_limit - 1;
> +  for (unsigned int eax = cpuid_extended_leaf_offset + 1;
> +       eax <= maximum_leaf; ++eax)
> +    {
> +      for (unsigned int ecx = 0; ecx <= ccd->ecx_limit; ++ecx)
> +        if (!_dl_diagnostics_cpuid_store (ccd, eax, ecx))
> +          return false;
> +    }
> +  return true;
> +}
> +
> +/* Call _dl_diagnostics_cpuid_collect_1 twice if necessary, the
> +   second time with the limit applied.  */
> +static void
> +_dl_diagnostics_cpuid_collect (struct cpuid_collected_data *ccd)
> +{
> +  if (!_dl_diagnostics_cpuid_collect_1 (ccd, false))
> +    _dl_diagnostics_cpuid_collect_1 (ccd, true);
> +
> +  /* Re-use the result of the official feature probing here.  */
> +  const struct cpu_features *cpu_features = __get_cpu_features ();
> +  if (CPU_FEATURES_CPU_P (cpu_features, OSXSAVE))
> +    {
> +      unsigned int xcrlow;
> +      unsigned int xcrhigh;
> +      asm ("xgetbv" : "=a" (xcrlow), "=d" (xcrhigh) : "c" (0));
> +      ccd->xgetbv_ecx_0 = ((uint64_t) xcrhigh << 32) + xcrlow;
> +    }
> +  else
> +    ccd->xgetbv_ecx_0 = 0;
> +}
> +
> +/* Print a CPUID register value (passed as REG_VALUE) if it differs
> +   from the expected REG_REFERENCE value.  PROCESSOR_INDEX is the
> +   process sequence number (always starting at zero; not a kernel ID).  */
> +static void
> +_dl_diagnostics_cpuid_print_reg (unsigned int processor_index,
> +                                 const struct cpuid_query *q,
> +                                 const char *reg_label, unsigned int reg_value,
> +                                 bool subleaf)
> +{
> +  if (subleaf)
> +    _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
> +                ".ecx[0x%x].%s=0x%x\n",
> +                processor_index, q->eax, q->ecx_first, reg_label, reg_value);
> +  else
> +    _dl_printf ("x86.processor[0x%x].cpuid.eax[0x%x].%s=0x%x\n",
> +                processor_index, q->eax, reg_label, reg_value);
> +}
> +
> +/* Print CPUID result values in *RESULT for the query in
> +   CCD->qr[CCD_IDX].  PROCESSOR_INDEX is the process sequence number
> +   (always starting at zero; not a kernel ID).  */
> +static void
> +_dl_diagnostics_cpuid_print_query (unsigned int processor_index,
> +                                   struct cpuid_collected_data *ccd,
> +                                   unsigned int ccd_idx,
> +                                   const struct cpuid_registers *result)
> +{
> +  /* Treat this as a value if subleaves if ecx isn't zero (maybe
> +     within the [ecx_fist, ecx_last] range), or if eax matches its
> +     neighbors.  If the range is [0, ecx_limit], then the subleaves
> +     are not distinct (independently of ecx_preserves_query_byte),
> +     so do not report them separately.  */
> +  struct cpuid_query *q = &ccd->qr[ccd_idx].q;
> +  bool subleaf = (q->ecx_first > 0
> +                  || (q->ecx_first != q->ecx_last
> +                      && !(q->ecx_first == 0 && q->ecx_last == ccd->ecx_limit))
> +                  || (ccd_idx > 0 && q->eax == ccd->qr[ccd_idx - 1].q.eax)
> +                  || (ccd_idx + 1 < ccd->used
> +                      && q->eax == ccd->qr[ccd_idx + 1].q.eax));
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "eax", result->eax,
> +                                   subleaf);
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "ebx", result->ebx,
> +                                   subleaf);
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "ecx", result->ecx,
> +                                   subleaf);
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "edx", result->edx,
> +                                   subleaf);
> +
> +  if (subleaf && q->ecx_first != q->ecx_last)
> +    {
> +      _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
> +                  ".ecx[0x%x].until_ecx=0x%x\n",
> +                  processor_index, q->eax, q->ecx_first, q->ecx_last);
> +      if (q->ecx_preserves_query_byte)
> +        _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
> +                    ".ecx[0x%x].ecx_query_mask=0xff\n",
> +                    processor_index, q->eax, q->ecx_first);
> +    }
> +}
> +
> +/* Perform differential reporting of the data in *CURRENT against
> +   *BASE.  REQUESTED_CPU is the kernel CPU ID the thread was
> +   configured to run on, or -1 if no configuration was possible.
> +   PROCESSOR_INDEX is the process sequence number (always starting at
> +   zero; not a kernel ID).  */
> +static void
> +_dl_diagnostics_cpuid_report (unsigned int processor_index, int requested_cpu,
> +                              struct cpuid_collected_data *current,
> +                              struct cpuid_collected_data *base)
> +{
> +  if (requested_cpu >= 0)
> +    _dl_printf ("x86.processor[0x%x].requested=0x%x\n",
> +                processor_index, requested_cpu);
> +
> +  /* Despite CPU pinning, the requested CPU number may be different
> +     from the one we are running on.  Some container hosts behave this
> +     way.  */
> +  {
> +    unsigned int cpu_number;
> +    unsigned int node_number;
> +    if (INTERNAL_SYSCALL_CALL (getcpu, &cpu_number, &node_number) >= 0)
> +      {
> +        if (cpu_number != requested_cpu)
> +          _dl_printf ("x86.processor[0x%x].observed=0x%x\n",
> +                      processor_index, cpu_number);
> +        _dl_printf ("x86.processor[0x%x].observed_node=0x%x\n",
> +                    processor_index, node_number);
> +      }
> +  }
> +
> +  _dl_printf ("x86.processor[0x%x].cpuid_leaves=0x%x\n",
> +              processor_index, current->used);
> +  _dl_printf ("x86.processor[0x%x].ecx_limit=0x%x\n",
> +              processor_index, current->ecx_limit);
> +
> +  unsigned int base_idx = 0;
> +  for (unsigned int current_idx = 0; current_idx < current->used;
> +       ++current_idx)
> +    {
> +      /* Report missing data on the current CPU as 0.  */
> +      unsigned long long int current_query
> +        = cpuid_query_combined (&current->qr[current_idx].q);
> +      while (base_idx < base->used
> +             && cpuid_query_combined (&base->qr[base_idx].q) < current_query)
> +      {
> +        _dl_diagnostics_cpuid_print_query (processor_index, base, base_idx,
> +                                           &cpuid_registers_zero);
> +        ++base_idx;
> +      }
> +
> +      if (base_idx < base->used
> +          && cpuid_query_combined (&base->qr[base_idx].q) == current_query)
> +        {
> +          _Static_assert (sizeof (struct cpuid_registers) == 4 * 4,
> +                          "no padding in struct cpuid_registers");
> +          if (current->qr[current_idx].q.ecx_last
> +              != base->qr[base_idx].q.ecx_last
> +              || memcmp (&current->qr[current_idx].r,
> +                         &base->qr[base_idx].r,
> +                         sizeof (struct cpuid_registers)) != 0)
> +              /* The ECX range or the values have changed.  Show the
> +                 new values.  */
> +            _dl_diagnostics_cpuid_print_query (processor_index,
> +                                               current, current_idx,
> +                                               &current->qr[current_idx].r);
> +          ++base_idx;
> +        }
> +      else
> +        /* Data is absent in the base reference.  Report the new data.  */
> +        _dl_diagnostics_cpuid_print_query (processor_index,
> +                                           current, current_idx,
> +                                           &current->qr[current_idx].r);
> +    }
> +
> +  if (current->xgetbv_ecx_0 != base->xgetbv_ecx_0)
> +    {
> +      /* Re-use the 64-bit printing routine.  */
> +      _dl_printf ("x86.processor[0x%x].", processor_index);
> +      _dl_diagnostics_print_labeled_value ("xgetbv.ecx[0x0]",
> +                                           current->xgetbv_ecx_0);
> +    }
> +}
> +
> +void
> +_dl_diagnostics_cpu_kernel (void)
> +{
> +#if !HAS_CPUID
> +  /* CPUID is not supported, so there is nothing to dump.  */
> +  if (__get_cpuid_max (0, 0) == 0)
> +    return;
> +#endif

I think we don't support __i486__ anymore, so we can just assume HAS_CPUID
at sysdeps/x86/include/cpu-features.h. 

> +
> +  /* The number of processors reported so far.  Note that is a count,
> +     not a kernel CPU number.  */
> +  unsigned int processor_index = 0;
> +
> +  /* Two copies of the data are used.  Data is written to the index
> +     (processor_index & 1).  The previous version against which the
> +     data dump is reported is at index !(processor_index & 1).  */
> +  struct cpuid_collected_data ccd[2];
> +
> +  /* The initial data is presumed to be all zero.  Zero results are
> +     not recorded.  */
> +  ccd[1].used = 0;
> +  ccd[1].xgetbv_ecx_0 = 0;
> +
> +  /* Run the CPUID probing on a specific CPU.  There are expected
> +     differences for encoding core IDs and topology information in
> +     CPUID output, but some firmware/kernel bugs also may result in
> +     asymmetric data across CPUs in some cases.
> +
> +     The CPU mask arrays are large enough for 4096 or 8192 CPUs, which
> +     should give ample space for future expansion.  */
> +  unsigned long int mask_reference[1024];
> +  int length_reference
> +    = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0,
> +                             sizeof (mask_reference), mask_reference);
> +
> +  /* A parallel bit mask that is used below to request running on a
> +     specific CPU.  */
> +  unsigned long int mask_request[array_length (mask_reference)];
> +
> +  if (length_reference >= sizeof (long))
> +    {
> +      /* The kernel is supposed to return a multiple of the word size.  */
> +      length_reference /= sizeof (long);
> +
> +      for (unsigned int i = 0; i < length_reference; ++i)
> +        {

Why not use the interfaces to work on cpuset? 

  if (length_reference > 0)
    {
      int cpu_count = CPU_COUNT_S (length_reference, mask_reference);
      for (int i = 0; i < cpu_count; i++)
        {
          if (CPU_ISSET_S (i, length_reference, mask_reference)
            {
              CPU_SET_S (i, length_reference, mask_request);
              if (INTERNAL_SYSCALL_CALL (sched_setaffinity, 0,
					 length_reference, mask_request) == 0)
                {
                  _dl_diagnostics_cpuid_collect (&ccd[i & 1]);
                  _dl_diagnostics_cpuid_report (processor_index, i, 
						&ccd[processor_index & 1],
						&ccd[!(processor_index & 1)]);
                  ++processor_index;
                }
              CPU_CLR_S (i, length_reference, mask_request);
            }
        }
    }

I will iterate over the list twice, but I don't think this would really matter
here.

> +          /* Iterate over the bits in mask_request[i] and process
> +             those that are set; j is the bit index, bitmask is the
> +             derived mask for the bit at this index.  */
> +          unsigned int j = 0;
> +          for (unsigned long int bitmask = 1; bitmask != 0; bitmask <<= 1, ++j)
> +            {
> +              mask_request[i] = mask_reference[i] & bitmask;
> +              if (mask_request[i])
> +                {
> +                  unsigned int mask_array_length
> +                    = (i + 1) * sizeof (unsigned long int);
> +                  if (INTERNAL_SYSCALL_CALL (sched_setaffinity, 0,
> +                                             mask_array_length,
> +                                             mask_request) == 0)
> +                    {
> +                      /* This is the CPU ID number used by the
> +                         kernel.  It should match the first result
> +                         from getcpu.  */
> +                      int requested_cpu = i * ULONG_WIDTH + j;
> +                      _dl_diagnostics_cpuid_collect
> +                        (&ccd[processor_index & 1]);
> +                      _dl_diagnostics_cpuid_report
> +                        (processor_index, requested_cpu,
> +                         &ccd[processor_index & 1],
> +                         &ccd[!(processor_index & 1)]);
> +                      ++processor_index;
> +                    }
> +                }
> +            }
> +          /* Reset the mask word, so that the mask has always
> +             population count one.  */
> +          mask_request[i] = 0;
> +        }
> +    }
> +
> +  /* Fallback if we could not deliberately select a CPU.  */
> +  if (processor_index == 0)
> +    {
> +      _dl_diagnostics_cpuid_collect (&ccd[0]);
> +      _dl_diagnostics_cpuid_report (processor_index, -1, &ccd[0], &ccd[1]);
> +    }
> +}
  
Noah Goldstein Sept. 11, 2023, 4:16 p.m. UTC | #4
On Sun, Sep 10, 2023 at 11:24 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Noah Goldstein:
>
> > On Fri, Sep 8, 2023 at 3:10 PM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> This is surprisingly difficult to implement if the goal is to produce
> >> reasonably sized output.  With the current approaches to output
> >> compression (suppressing zeros and repeated results between CPUs,
> >> folding ranges of identical subleaves, dealing with the %ecx
> >> reflection issue), the output is less than 600 KiB even for systems
> >> with 256 threads.
> >>
> > Maybe should just output a complete json?
>
> JSON cannot directly represent 64-bit integers, so it would need some
> custom transformation for other parts of the --list-diagnostics output.
>
> > Then users can pretty easily write scripts to extract the exact information
> > they are after. Or or the dumper can be extended in the future to let
> > the user specify fields/values to dump so it can be configured to be more
> > reasonable?
>
> I'm not sure what is unreasonable about the current implementation?  I
> complained about how hard it is getting the data and distilling it into
> something that is not a gigantic data blob.
>
> To be clear, without only trivial zero-values suppression, brute-force
> enumeration (cutting off at 512 subleaves) results in roughly 8 KiB of
> raw data per *CPU*.  It's even larger for recent CPUs which have more of
> the funny ECX behavior (where unsupported subleaves do not come back as
> zero).

Maybe I misunderstand but the commit message is saying a 256 core system
dumps 600KB?

If so, that seems like a lot for a person to just grok hence why I'd
favor a standardized
format.

If JSON isn't really feasible for technical reasons, however, so be it.
>
> Thanks,
> Florian
>
  
Florian Weimer Sept. 11, 2023, 4:19 p.m. UTC | #5
* Adhemerval Zanella Netto:

>> +void
>> +_dl_diagnostics_cpu_kernel (void)
>> +{
>> +#if !HAS_CPUID
>> +  /* CPUID is not supported, so there is nothing to dump.  */
>> +  if (__get_cpuid_max (0, 0) == 0)
>> +    return;
>> +#endif
>
> I think we don't support __i486__ anymore, so we can just assume HAS_CPUID
> at sysdeps/x86/include/cpu-features.h.

We still build an i486 variant in build-many-glibcs.py.
> Why not use the interfaces to work on cpuset? 
>
>   if (length_reference > 0)
>     {
>       int cpu_count = CPU_COUNT_S (length_reference, mask_reference);
>       for (int i = 0; i < cpu_count; i++)
>         {
>           if (CPU_ISSET_S (i, length_reference, mask_reference)
>             {
>               CPU_SET_S (i, length_reference, mask_request);
>               if (INTERNAL_SYSCALL_CALL (sched_setaffinity, 0,
> 					 length_reference, mask_request) == 0)
>                 {
>                   _dl_diagnostics_cpuid_collect (&ccd[i & 1]);
>                   _dl_diagnostics_cpuid_report (processor_index, i, 
> 						&ccd[processor_index & 1],
> 						&ccd[!(processor_index & 1)]);
>                   ++processor_index;
>                 }
>               CPU_CLR_S (i, length_reference, mask_request);
>             }
>         }
>     }
>
> I will iterate over the list twice, but I don't think this would really matter
> here.

I think the macros are somewhat incompatible with direct system calls.
CPU_COUNT_S requires that the tail is zeroed, which the system call
doesn't do.  Maybe we can memset the whole thing before the sched_getcpu
system call.  I haven't tried if we can directly rebuild
__sched_cpucount for ld.so, either.

Thanks,
Florian
  
Florian Weimer Sept. 11, 2023, 4:25 p.m. UTC | #6
* Noah Goldstein:

> On Sun, Sep 10, 2023 at 11:24 PM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Noah Goldstein:
>>
>> > On Fri, Sep 8, 2023 at 3:10 PM Florian Weimer <fweimer@redhat.com> wrote:
>> >>
>> >> This is surprisingly difficult to implement if the goal is to produce
>> >> reasonably sized output.  With the current approaches to output
>> >> compression (suppressing zeros and repeated results between CPUs,
>> >> folding ranges of identical subleaves, dealing with the %ecx
>> >> reflection issue), the output is less than 600 KiB even for systems
>> >> with 256 threads.
>> >>
>> > Maybe should just output a complete json?
>>
>> JSON cannot directly represent 64-bit integers, so it would need some
>> custom transformation for other parts of the --list-diagnostics output.
>>
>> > Then users can pretty easily write scripts to extract the exact information
>> > they are after. Or or the dumper can be extended in the future to let
>> > the user specify fields/values to dump so it can be configured to be more
>> > reasonable?
>>
>> I'm not sure what is unreasonable about the current implementation?  I
>> complained about how hard it is getting the data and distilling it into
>> something that is not a gigantic data blob.
>>
>> To be clear, without only trivial zero-values suppression, brute-force
>> enumeration (cutting off at 512 subleaves) results in roughly 8 KiB of
>> raw data per *CPU*.  It's even larger for recent CPUs which have more of
>> the funny ECX behavior (where unsupported subleaves do not come back as
>> zero).
>
> Maybe I misunderstand but the commit message is saying a 256 core system
> dumps 600KB?

For all CPUs, after hex encoding.  So that's more like 1,000 bytes of
data per CPU (and the 8 KiB number was just an estimate, for two ECX
runaways, if you have more of those the number grows quickly).

> If so, that seems like a lot for a person to just grok hence why I'd
> favor a standardized format.

Sure, this calls out for automated processing.  We have Python parsing
code in the testsuite, which could be repurposed.

> If JSON isn't really feasible for technical reasons, however, so be it.

There is that 53 bit problem, and we'd still have to use string keys for
the objects.

Thanks,
Florian
  
Noah Goldstein Sept. 11, 2023, 4:28 p.m. UTC | #7
On Mon, Sep 11, 2023 at 11:26 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Noah Goldstein:
>
> > On Sun, Sep 10, 2023 at 11:24 PM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Noah Goldstein:
> >>
> >> > On Fri, Sep 8, 2023 at 3:10 PM Florian Weimer <fweimer@redhat.com> wrote:
> >> >>
> >> >> This is surprisingly difficult to implement if the goal is to produce
> >> >> reasonably sized output.  With the current approaches to output
> >> >> compression (suppressing zeros and repeated results between CPUs,
> >> >> folding ranges of identical subleaves, dealing with the %ecx
> >> >> reflection issue), the output is less than 600 KiB even for systems
> >> >> with 256 threads.
> >> >>
> >> > Maybe should just output a complete json?
> >>
> >> JSON cannot directly represent 64-bit integers, so it would need some
> >> custom transformation for other parts of the --list-diagnostics output.
> >>
> >> > Then users can pretty easily write scripts to extract the exact information
> >> > they are after. Or or the dumper can be extended in the future to let
> >> > the user specify fields/values to dump so it can be configured to be more
> >> > reasonable?
> >>
> >> I'm not sure what is unreasonable about the current implementation?  I
> >> complained about how hard it is getting the data and distilling it into
> >> something that is not a gigantic data blob.
> >>
> >> To be clear, without only trivial zero-values suppression, brute-force
> >> enumeration (cutting off at 512 subleaves) results in roughly 8 KiB of
> >> raw data per *CPU*.  It's even larger for recent CPUs which have more of
> >> the funny ECX behavior (where unsupported subleaves do not come back as
> >> zero).
> >
> > Maybe I misunderstand but the commit message is saying a 256 core system
> > dumps 600KB?
>
> For all CPUs, after hex encoding.  So that's more like 1,000 bytes of
> data per CPU (and the 8 KiB number was just an estimate, for two ECX
> runaways, if you have more of those the number grows quickly).
>
> > If so, that seems like a lot for a person to just grok hence why I'd
> > favor a standardized format.
>
> Sure, this calls out for automated processing.  We have Python parsing
> code in the testsuite, which could be repurposed.
>
> > If JSON isn't really feasible for technical reasons, however, so be it.
>
> There is that 53 bit problem, and we'd still have to use string keys for
> the objects.

We can't find meaningful names for the information? If not do we even
want to dump it?

Although I'm probably trivializing the problem, a quick glance at:
https://github.com/google/cpu_features
and there json dump skips everything but feature flags.
>
> Thanks,
> Florian
>
  
Noah Goldstein Sept. 11, 2023, 4:31 p.m. UTC | #8
On Mon, Sep 11, 2023 at 11:28 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Sep 11, 2023 at 11:26 AM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * Noah Goldstein:
> >
> > > On Sun, Sep 10, 2023 at 11:24 PM Florian Weimer <fweimer@redhat.com> wrote:
> > >>
> > >> * Noah Goldstein:
> > >>
> > >> > On Fri, Sep 8, 2023 at 3:10 PM Florian Weimer <fweimer@redhat.com> wrote:
> > >> >>
> > >> >> This is surprisingly difficult to implement if the goal is to produce
> > >> >> reasonably sized output.  With the current approaches to output
> > >> >> compression (suppressing zeros and repeated results between CPUs,
> > >> >> folding ranges of identical subleaves, dealing with the %ecx
> > >> >> reflection issue), the output is less than 600 KiB even for systems
> > >> >> with 256 threads.
> > >> >>
> > >> > Maybe should just output a complete json?
> > >>
> > >> JSON cannot directly represent 64-bit integers, so it would need some
> > >> custom transformation for other parts of the --list-diagnostics output.
> > >>
> > >> > Then users can pretty easily write scripts to extract the exact information
> > >> > they are after. Or or the dumper can be extended in the future to let
> > >> > the user specify fields/values to dump so it can be configured to be more
> > >> > reasonable?
> > >>
> > >> I'm not sure what is unreasonable about the current implementation?  I
> > >> complained about how hard it is getting the data and distilling it into
> > >> something that is not a gigantic data blob.
> > >>
> > >> To be clear, without only trivial zero-values suppression, brute-force
> > >> enumeration (cutting off at 512 subleaves) results in roughly 8 KiB of
> > >> raw data per *CPU*.  It's even larger for recent CPUs which have more of
> > >> the funny ECX behavior (where unsupported subleaves do not come back as
> > >> zero).
> > >
> > > Maybe I misunderstand but the commit message is saying a 256 core system
> > > dumps 600KB?
> >
> > For all CPUs, after hex encoding.  So that's more like 1,000 bytes of
> > data per CPU (and the 8 KiB number was just an estimate, for two ECX
> > runaways, if you have more of those the number grows quickly).
> >
> > > If so, that seems like a lot for a person to just grok hence why I'd
> > > favor a standardized format.
> >
> > Sure, this calls out for automated processing.  We have Python parsing
> > code in the testsuite, which could be repurposed.
> >
> > > If JSON isn't really feasible for technical reasons, however, so be it.
> >
> > There is that 53 bit problem, and we'd still have to use string keys for
> > the objects.
>
> We can't find meaningful names for the information? If not do we even
> want to dump it?
>
> Although I'm probably trivializing the problem, a quick glance at:
> https://github.com/google/cpu_features
> and there json dump skips everything but feature flags.

Thats not right actually, they dump itlb/cache info but don't do per-thread.
> >
> > Thanks,
> > Florian
> >
  
Adhemerval Zanella Netto Sept. 11, 2023, 4:41 p.m. UTC | #9
On 11/09/23 13:19, Florian Weimer wrote:
> * Adhemerval Zanella Netto:
> 
>>> +void
>>> +_dl_diagnostics_cpu_kernel (void)
>>> +{
>>> +#if !HAS_CPUID
>>> +  /* CPUID is not supported, so there is nothing to dump.  */
>>> +  if (__get_cpuid_max (0, 0) == 0)
>>> +    return;
>>> +#endif
>>
>> I think we don't support __i486__ anymore, so we can just assume HAS_CPUID
>> at sysdeps/x86/include/cpu-features.h.
> 
> We still build an i486 variant in build-many-glibcs.py.

Indeed.

>> Why not use the interfaces to work on cpuset? 
>>
>>   if (length_reference > 0)
>>     {
>>       int cpu_count = CPU_COUNT_S (length_reference, mask_reference);
>>       for (int i = 0; i < cpu_count; i++)
>>         {
>>           if (CPU_ISSET_S (i, length_reference, mask_reference)
>>             {
>>               CPU_SET_S (i, length_reference, mask_request);
>>               if (INTERNAL_SYSCALL_CALL (sched_setaffinity, 0,
>> 					 length_reference, mask_request) == 0)
>>                 {
>>                   _dl_diagnostics_cpuid_collect (&ccd[i & 1]);
>>                   _dl_diagnostics_cpuid_report (processor_index, i, 
>> 						&ccd[processor_index & 1],
>> 						&ccd[!(processor_index & 1)]);
>>                   ++processor_index;
>>                 }
>>               CPU_CLR_S (i, length_reference, mask_request);
>>             }
>>         }
>>     }
>>
>> I will iterate over the list twice, but I don't think this would really matter
>> here.
> 
> I think the macros are somewhat incompatible with direct system calls.
> CPU_COUNT_S requires that the tail is zeroed, which the system call
> doesn't do.  Maybe we can memset the whole thing before the sched_getcpu
> system call.  I haven't tried if we can directly rebuild
> __sched_cpucount for ld.so, either.

Afaiu the setsize is exactly to avoid the tail zeroing and to use the syscall
return code, isn't?  It does not work outside libc because sched_getaffinity
does not return the bytes set from the kernel, but it does work with direct
syscalls:

$ cat > test.c <<EOF
#define _GNU_SOURCE
#include <assert.h>
#include <sched.h>
#include <stdio.h>
#include <string.h>
#include <sys/syscall.h>
#include <unistd.h>

int
main (int argc, char *argv[])
{
  size_t cpusetsize = CPU_ALLOC_SIZE (1024);
  cpu_set_t cpuset[cpusetsize];
  memset (cpuset, 0xff, cpusetsize * sizeof (cpu_set_t));

  long int sz = syscall (SYS_sched_getaffinity, getpid (), cpusetsize, cpuset);

  printf ("%ld %d\n", sz, CPU_COUNT_S (sz, cpuset));

  return 0;
}
EOF
$ gcc -Wall test.c -o test && ./test
8 24
  
Florian Weimer Sept. 11, 2023, 5:48 p.m. UTC | #10
* Noah Goldstein:

>> > If JSON isn't really feasible for technical reasons, however, so be it.
>>
>> There is that 53 bit problem, and we'd still have to use string keys for
>> the objects.
>
> We can't find meaningful names for the information? If not do we even
> want to dump it?

My goal was to write a generic dumper, not something we need update for
every new CPU generation.  That's why I didn't want to hard-code subleaf
structure and implement a brute-force approach instead.

For example, when we had the leaf 2 problems, the existing dumps were
insufficient.  Tomorrow it might be a completely different leaf.  And
the dumper intends to cover everything, including parts that are only
used by (possibly future) GCC versions in __builtin_cpu_supports.

Thanks,
Florian
  
Noah Goldstein Sept. 11, 2023, 6:35 p.m. UTC | #11
On Mon, Sep 11, 2023 at 12:48 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Noah Goldstein:
>
> >> > If JSON isn't really feasible for technical reasons, however, so be it.
> >>
> >> There is that 53 bit problem, and we'd still have to use string keys for
> >> the objects.
> >
> > We can't find meaningful names for the information? If not do we even
> > want to dump it?
>
> My goal was to write a generic dumper, not something we need update for
> every new CPU generation.  That's why I didn't want to hard-code subleaf
> structure and implement a brute-force approach instead.
>
> For example, when we had the leaf 2 problems, the existing dumps were
> insufficient.  Tomorrow it might be a completely different leaf.  And
> the dumper intends to cover everything, including parts that are only
> used by (possibly future) GCC versions in __builtin_cpu_supports.
>
I see. I guess that makes sense.
> Thanks,
> Florian
>
  

Patch

diff --git a/manual/dynlink.texi b/manual/dynlink.texi
index 06a6c15533..1f02124722 100644
--- a/manual/dynlink.texi
+++ b/manual/dynlink.texi
@@ -228,7 +228,91 @@  reported by the @code{uname} function.  @xref{Platform Type}.
 @item x86.cpu_features.@dots{}
 These items are specific to the i386 and x86-64 architectures.  They
 reflect supported CPU features and information on cache geometry, mostly
-collected using the @code{CPUID} instruction.
+collected using the CPUID instruction.
+
+@item x86.processor[@var{index}].@dots{}
+These are additional items for the i386 and x86-64 architectures, as
+described below.  They mostly contain raw data from the CPUID
+instruction.  The probes are performed for each active CPU for the
+@code{ld.so} process, and data for different probed CPUs receives a
+uniqe @var{index} value.  Some CPUID data is expected to differ from CPU
+core to CPU core.  In some cases, CPUs are not correctly initialized and
+indicate the presence of different feature sets.
+
+@item x86.processor[@var{index}].requested=@var{kernel-cpu}
+The kernel is told to run the subsequent probing on the CPU numbered
+@var{kernel-cpu}.  The values @var{kernel-cpu} and @var{index} can be
+distinct if there are gaps in the process CPU affinity mask.  This line
+is not included if CPU affinity mask information is not available.
+
+@item x86.processor[@var{index}].observed=@var{kernel-cpu}
+This line reports the kernel CPU number @var{kernel-cpu} on which the
+probing code initially ran.  This line is only printed if the requested
+and observed kernel CPU numbers differ.  This can happen if the kernel
+fails to act on a request to change the process CPU affinity mask.
+
+@item x86.processor[@var{index}].observed_node=@var{node}
+This reports the observed NUMA node number, as reported by the
+@code{getcpu} system call.  It is missing if the @code{getcpu} system
+call failed.
+
+@item x86.processor[@var{index}].cpuid_leaves=@var{count}
+This line indicates that @var{count} distinct CPUID leaves were
+encountered.  (This reflects internal @code{ld.so} storage space, it
+does not directly correspond to @code{CPUID} enumeration ranges.)
+
+@item x86.processor[@var{index}].ecx_limit=@var{value}
+The CPUID data extraction code uses a brute-force approach to enumerate
+subleaves (see the @samp{.subleaf_eax} lines below).  The last
+@code{%rcx} value used in a CPUID query on this probed CPU was
+@var{value}.
+
+@item x86.processor[@var{index}].cpuid.eax[@var{query_eax}].eax=@var{eax}
+@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].ebx=@var{ebx}
+@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].ecx=@var{ecx}
+@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].edx=@var{edx}
+These lines report the register contents after executing the CPUID
+instruction with @samp{%rax == @var{query_eax}} and @samp{%rcx == 0} (a
+@dfn{leaf}).  For the first probed CPU (with a zero @var{index}), only
+leaves with non-zero register contents are reported.  For subsequent
+CPUs, only leaves whose register contents differs from the previously
+probed CPUs (with @var{index} one less) are reported.
+
+Basic and extended leaves are reported using the same syntax.  This
+means there is a large jump in @var{query_eax} for the first reported
+extended leaf.
+
+@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].eax=@var{eax}
+@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ebx=@var{ebx}
+@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ecx=@var{ecx}
+@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].edx=@var{edx}
+This is similar to the leaves above, but for a @dfn{subleaf}.  For
+subleaves, the CPUID instruction is executed with @samp{%rax ==
+@var{query_eax}} and @samp{%rcx == @var{query_ecx}}, so the result
+depends on both register values.  The same rules about filtering zero
+and identical results apply.
+
+@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].until_ecx=@var{ecx_limit}
+Some CPUID results are the same regardless the @var{query_ecx} value.
+If this situation is detected, a line with the @samp{.until_ecx}
+selector ins included, and this indicates that the CPUID register
+contents is the same for @code{%rcx} values between @var{query_ecx}
+and @var{ecx_limit} (inclusive).
+
+@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ecx_query_mask=0xff
+This line indicates that in an @samp{.until_ecx} range, the CPUID
+instruction preserved the lowested 8 bits of the input @code{%rcx} in
+the output @code{%rcx} registers.  Otherwise, the subleaves in the range
+have identical values.  This special treatment is necessary to report
+compact range information in case such copying occurs (because the
+subleaves would otherwise be all different).
+
+@item x86.processor[@var{index}].xgetbv.ecx[@var{query_ecx}]=@var{result}
+This line shows the 64-bit @var{result} value in the @code{%rdx:%rax}
+register pair after executing the XGETBV instruction with @code{%rcx}
+set to @var{query_ecx}.  Zero values and values matching the previously
+probed CPU are omitted.  Nothing is printed if the system does not
+support the XGETBV instruction.
 @end table
 
 @node Dynamic Linker Introspection
diff --git a/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c b/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
new file mode 100644
index 0000000000..f84331b33b
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
@@ -0,0 +1,457 @@ 
+/* Print CPU/kernel diagnostics data in ld.so.  Version for x86.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <dl-diagnostics.h>
+
+#include <array_length.h>
+#include <cpu-features.h>
+#include <cpuid.h>
+#include <ldsodefs.h>
+#include <stdbool.h>
+#include <string.h>
+#include <sysdep.h>
+
+/* Register arguments to CPUID.  Multiple ECX subleaf values yielding
+   the same result are combined, to shorten the output.  Both
+   identical matches (EAX to EDX are the same) and matches where EAX,
+   EBX, EDX, and ECX are equal except in the lower byte, which must
+   match the query ECX value.  The latter is needed to compress ranges
+   on CPUs which preserve the lowest byte in ECX if an unknown leaf is
+   queried.  */
+struct cpuid_query
+{
+  unsigned int eax;
+  unsigned ecx_first;
+  unsigned ecx_last;
+  bool ecx_preserves_query_byte;
+};
+
+/* Single integer value that can be used for sorting/ordering
+   comparisons.  Uses Q->eax and Q->ecx_first only because ecx_last is
+   always greater than the previous ecx_first value and less than the
+   subsequent one.  */
+static inline unsigned long long int
+cpuid_query_combined (struct cpuid_query *q)
+{
+  /* ecx can be -1 (that is, ~0U).  If this happens, this the only ecx
+     value for this eax value, so the ordering does not matter.  */
+  return ((unsigned long long int) q->eax << 32) | (unsigned int) q->ecx_first;
+};
+
+/* Used for differential reporting of zero/non-zero values.  */
+static const struct cpuid_registers cpuid_registers_zero;
+
+/* Register arguments to CPUID paired with the results that came back.  */
+struct cpuid_query_result
+{
+  struct cpuid_query q;
+  struct cpuid_registers r;
+};
+
+/* During a first enumeration pass, we try to collect data for
+  cpuid_initial_subleaf_limit subleaves per leaf/EAX value.  If we run
+  out of space, we try once more with applying the lower limit.  */
+enum { cpuid_main_leaf_limit = 128 };
+enum { cpuid_initial_subleaf_limit = 512 };
+enum { cpuid_subleaf_limit = 32 };
+
+/* Offset of the extended leaf area.  */
+enum {cpuid_extended_leaf_offset = 0x80000000 };
+
+/* Collected CPUID data.  Everything is stored in a statically sized
+   array that is sized so that the second pass will collect some data
+   for all leaves, after the limit is applied.  On the second pass,
+   ecx_limit is set to cpuid_subleaf_limit.  */
+struct cpuid_collected_data
+{
+  unsigned int used;
+  unsigned int ecx_limit;
+  uint64_t xgetbv_ecx_0;
+  struct cpuid_query_result qr[cpuid_main_leaf_limit
+                               * 2 * cpuid_subleaf_limit];
+};
+
+/* Fill in the result of a CPUID query.  Returns true if there is
+   room, false if nothing could be stored.  */
+static bool
+_dl_diagnostics_cpuid_store (struct cpuid_collected_data *ccd,
+                             unsigned eax, int ecx)
+{
+  if (ccd->used >= array_length (ccd->qr))
+    return false;
+
+  /* Tentatively fill in the next value.  */
+  __cpuid_count (eax, ecx,
+                 ccd->qr[ccd->used].r.eax,
+                 ccd->qr[ccd->used].r.ebx,
+                 ccd->qr[ccd->used].r.ecx,
+                 ccd->qr[ccd->used].r.edx);
+
+  /* If the ECX subleaf is next subleaf after the previous one (for
+     the same leaf), and the values are the same, merge the result
+     with the already-stored one.  Do this before skipping zero
+     leaves, which avoids artifiacts for ECX == 256 queries.  */
+  if (ccd->used > 0
+      && ccd->qr[ccd->used - 1].q.eax == eax
+      && ccd->qr[ccd->used - 1].q.ecx_last + 1 == ecx)
+    {
+      /* Exact match of the previous result. Ignore the value of
+         ecx_preserves_query_byte if this is a singleton range so far
+         because we can treat ECX as fixed if the same value repeats.  */
+      if ((!ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte
+           || (ccd->qr[ccd->used - 1].q.ecx_first
+               == ccd->qr[ccd->used - 1].q.ecx_last))
+          && memcmp (&ccd->qr[ccd->used - 1].r, &ccd->qr[ccd->used].r,
+                     sizeof (ccd->qr[ccd->used].r)) == 0)
+        {
+          ccd->qr[ccd->used - 1].q.ecx_last = ecx;
+          /* ECX is now fixed because the same value has been observed
+             twice, even if we had a low-byte match before.  */
+          ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte = false;
+          return true;
+        }
+      /* Match except for the low byte in ECX, which must match the
+         incoming ECX value.  */
+      if (ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte
+          && (ecx & 0xff) == (ccd->qr[ccd->used].r.ecx & 0xff)
+          && ccd->qr[ccd->used].r.eax == ccd->qr[ccd->used - 1].r.eax
+          && ccd->qr[ccd->used].r.ebx == ccd->qr[ccd->used - 1].r.ebx
+          && ((ccd->qr[ccd->used].r.ecx & 0xffffff00)
+              == (ccd->qr[ccd->used - 1].r.ecx & 0xffffff00))
+          && ccd->qr[ccd->used].r.edx == ccd->qr[ccd->used - 1].r.edx)
+        {
+          ccd->qr[ccd->used - 1].q.ecx_last = ecx;
+          return true;
+        }
+    }
+
+  /* Do not store zero results.  All-zero values usually mean that the
+     subleaf is unsupported.  */
+  if (ccd->qr[ccd->used].r.eax == 0
+      && ccd->qr[ccd->used].r.ebx == 0
+      && ccd->qr[ccd->used].r.ecx == 0
+      && ccd->qr[ccd->used].r.edx == 0)
+    return true;
+
+  /* The result needs to be stored.  Fill in the query parameters and
+     consume the storage.  */
+  ccd->qr[ccd->used].q.eax = eax;
+  ccd->qr[ccd->used].q.ecx_first = ecx;
+  ccd->qr[ccd->used].q.ecx_last = ecx;
+  ccd->qr[ccd->used].q.ecx_preserves_query_byte
+    = (ecx & 0xff) == (ccd->qr[ccd->used].r.ecx & 0xff);
+  ++ccd->used;
+  return true;
+}
+
+/* Collected CPUID data into *CCD.  If LIMIT, apply per-leaf limits to
+   avoid exceeding the pre-allocated space.  Return true if all data
+   could be stored, false if the retrying without a limit is
+   requested.  */
+static bool
+_dl_diagnostics_cpuid_collect_1 (struct cpuid_collected_data *ccd, bool limit)
+{
+  ccd->used = 0;
+  ccd->ecx_limit
+    = (limit ? cpuid_subleaf_limit : cpuid_initial_subleaf_limit) - 1;
+  _dl_diagnostics_cpuid_store (ccd, 0x00, 0x00);
+  if (ccd->used == 0)
+    /* CPUID reported all 0.  Should not happen.  */
+    return true;
+  unsigned int maximum_leaf = ccd->qr[0x00].r.eax;
+  if (limit && maximum_leaf >= cpuid_main_leaf_limit)
+    maximum_leaf = cpuid_main_leaf_limit - 1;
+
+  for (unsigned int eax = 1; eax <= maximum_leaf; ++eax)
+    {
+      for (unsigned int ecx = 0; ecx <= ccd->ecx_limit; ++ecx)
+        if (!_dl_diagnostics_cpuid_store (ccd, eax, ecx))
+          return false;
+    }
+
+  if (!_dl_diagnostics_cpuid_store (ccd, cpuid_extended_leaf_offset, 0x00))
+    return false;
+  maximum_leaf = ccd->qr[ccd->used - 1].r.eax;
+  if (maximum_leaf < cpuid_extended_leaf_offset)
+    /* No extended CPUID information.  */
+    return true;
+  if (limit
+      && maximum_leaf - cpuid_extended_leaf_offset >= cpuid_main_leaf_limit)
+    maximum_leaf = cpuid_extended_leaf_offset + cpuid_main_leaf_limit - 1;
+  for (unsigned int eax = cpuid_extended_leaf_offset + 1;
+       eax <= maximum_leaf; ++eax)
+    {
+      for (unsigned int ecx = 0; ecx <= ccd->ecx_limit; ++ecx)
+        if (!_dl_diagnostics_cpuid_store (ccd, eax, ecx))
+          return false;
+    }
+  return true;
+}
+
+/* Call _dl_diagnostics_cpuid_collect_1 twice if necessary, the
+   second time with the limit applied.  */
+static void
+_dl_diagnostics_cpuid_collect (struct cpuid_collected_data *ccd)
+{
+  if (!_dl_diagnostics_cpuid_collect_1 (ccd, false))
+    _dl_diagnostics_cpuid_collect_1 (ccd, true);
+
+  /* Re-use the result of the official feature probing here.  */
+  const struct cpu_features *cpu_features = __get_cpu_features ();
+  if (CPU_FEATURES_CPU_P (cpu_features, OSXSAVE))
+    {
+      unsigned int xcrlow;
+      unsigned int xcrhigh;
+      asm ("xgetbv" : "=a" (xcrlow), "=d" (xcrhigh) : "c" (0));
+      ccd->xgetbv_ecx_0 = ((uint64_t) xcrhigh << 32) + xcrlow;
+    }
+  else
+    ccd->xgetbv_ecx_0 = 0;
+}
+
+/* Print a CPUID register value (passed as REG_VALUE) if it differs
+   from the expected REG_REFERENCE value.  PROCESSOR_INDEX is the
+   process sequence number (always starting at zero; not a kernel ID).  */
+static void
+_dl_diagnostics_cpuid_print_reg (unsigned int processor_index,
+                                 const struct cpuid_query *q,
+                                 const char *reg_label, unsigned int reg_value,
+                                 bool subleaf)
+{
+  if (subleaf)
+    _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
+                ".ecx[0x%x].%s=0x%x\n",
+                processor_index, q->eax, q->ecx_first, reg_label, reg_value);
+  else
+    _dl_printf ("x86.processor[0x%x].cpuid.eax[0x%x].%s=0x%x\n",
+                processor_index, q->eax, reg_label, reg_value);
+}
+
+/* Print CPUID result values in *RESULT for the query in
+   CCD->qr[CCD_IDX].  PROCESSOR_INDEX is the process sequence number
+   (always starting at zero; not a kernel ID).  */
+static void
+_dl_diagnostics_cpuid_print_query (unsigned int processor_index,
+                                   struct cpuid_collected_data *ccd,
+                                   unsigned int ccd_idx,
+                                   const struct cpuid_registers *result)
+{
+  /* Treat this as a value if subleaves if ecx isn't zero (maybe
+     within the [ecx_fist, ecx_last] range), or if eax matches its
+     neighbors.  If the range is [0, ecx_limit], then the subleaves
+     are not distinct (independently of ecx_preserves_query_byte),
+     so do not report them separately.  */
+  struct cpuid_query *q = &ccd->qr[ccd_idx].q;
+  bool subleaf = (q->ecx_first > 0
+                  || (q->ecx_first != q->ecx_last
+                      && !(q->ecx_first == 0 && q->ecx_last == ccd->ecx_limit))
+                  || (ccd_idx > 0 && q->eax == ccd->qr[ccd_idx - 1].q.eax)
+                  || (ccd_idx + 1 < ccd->used
+                      && q->eax == ccd->qr[ccd_idx + 1].q.eax));
+  _dl_diagnostics_cpuid_print_reg (processor_index, q, "eax", result->eax,
+                                   subleaf);
+  _dl_diagnostics_cpuid_print_reg (processor_index, q, "ebx", result->ebx,
+                                   subleaf);
+  _dl_diagnostics_cpuid_print_reg (processor_index, q, "ecx", result->ecx,
+                                   subleaf);
+  _dl_diagnostics_cpuid_print_reg (processor_index, q, "edx", result->edx,
+                                   subleaf);
+
+  if (subleaf && q->ecx_first != q->ecx_last)
+    {
+      _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
+                  ".ecx[0x%x].until_ecx=0x%x\n",
+                  processor_index, q->eax, q->ecx_first, q->ecx_last);
+      if (q->ecx_preserves_query_byte)
+        _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
+                    ".ecx[0x%x].ecx_query_mask=0xff\n",
+                    processor_index, q->eax, q->ecx_first);
+    }
+}
+
+/* Perform differential reporting of the data in *CURRENT against
+   *BASE.  REQUESTED_CPU is the kernel CPU ID the thread was
+   configured to run on, or -1 if no configuration was possible.
+   PROCESSOR_INDEX is the process sequence number (always starting at
+   zero; not a kernel ID).  */
+static void
+_dl_diagnostics_cpuid_report (unsigned int processor_index, int requested_cpu,
+                              struct cpuid_collected_data *current,
+                              struct cpuid_collected_data *base)
+{
+  if (requested_cpu >= 0)
+    _dl_printf ("x86.processor[0x%x].requested=0x%x\n",
+                processor_index, requested_cpu);
+
+  /* Despite CPU pinning, the requested CPU number may be different
+     from the one we are running on.  Some container hosts behave this
+     way.  */
+  {
+    unsigned int cpu_number;
+    unsigned int node_number;
+    if (INTERNAL_SYSCALL_CALL (getcpu, &cpu_number, &node_number) >= 0)
+      {
+        if (cpu_number != requested_cpu)
+          _dl_printf ("x86.processor[0x%x].observed=0x%x\n",
+                      processor_index, cpu_number);
+        _dl_printf ("x86.processor[0x%x].observed_node=0x%x\n",
+                    processor_index, node_number);
+      }
+  }
+
+  _dl_printf ("x86.processor[0x%x].cpuid_leaves=0x%x\n",
+              processor_index, current->used);
+  _dl_printf ("x86.processor[0x%x].ecx_limit=0x%x\n",
+              processor_index, current->ecx_limit);
+
+  unsigned int base_idx = 0;
+  for (unsigned int current_idx = 0; current_idx < current->used;
+       ++current_idx)
+    {
+      /* Report missing data on the current CPU as 0.  */
+      unsigned long long int current_query
+        = cpuid_query_combined (&current->qr[current_idx].q);
+      while (base_idx < base->used
+             && cpuid_query_combined (&base->qr[base_idx].q) < current_query)
+      {
+        _dl_diagnostics_cpuid_print_query (processor_index, base, base_idx,
+                                           &cpuid_registers_zero);
+        ++base_idx;
+      }
+
+      if (base_idx < base->used
+          && cpuid_query_combined (&base->qr[base_idx].q) == current_query)
+        {
+          _Static_assert (sizeof (struct cpuid_registers) == 4 * 4,
+                          "no padding in struct cpuid_registers");
+          if (current->qr[current_idx].q.ecx_last
+              != base->qr[base_idx].q.ecx_last
+              || memcmp (&current->qr[current_idx].r,
+                         &base->qr[base_idx].r,
+                         sizeof (struct cpuid_registers)) != 0)
+              /* The ECX range or the values have changed.  Show the
+                 new values.  */
+            _dl_diagnostics_cpuid_print_query (processor_index,
+                                               current, current_idx,
+                                               &current->qr[current_idx].r);
+          ++base_idx;
+        }
+      else
+        /* Data is absent in the base reference.  Report the new data.  */
+        _dl_diagnostics_cpuid_print_query (processor_index,
+                                           current, current_idx,
+                                           &current->qr[current_idx].r);
+    }
+
+  if (current->xgetbv_ecx_0 != base->xgetbv_ecx_0)
+    {
+      /* Re-use the 64-bit printing routine.  */
+      _dl_printf ("x86.processor[0x%x].", processor_index);
+      _dl_diagnostics_print_labeled_value ("xgetbv.ecx[0x0]",
+                                           current->xgetbv_ecx_0);
+    }
+}
+
+void
+_dl_diagnostics_cpu_kernel (void)
+{
+#if !HAS_CPUID
+  /* CPUID is not supported, so there is nothing to dump.  */
+  if (__get_cpuid_max (0, 0) == 0)
+    return;
+#endif
+
+  /* The number of processors reported so far.  Note that is a count,
+     not a kernel CPU number.  */
+  unsigned int processor_index = 0;
+
+  /* Two copies of the data are used.  Data is written to the index
+     (processor_index & 1).  The previous version against which the
+     data dump is reported is at index !(processor_index & 1).  */
+  struct cpuid_collected_data ccd[2];
+
+  /* The initial data is presumed to be all zero.  Zero results are
+     not recorded.  */
+  ccd[1].used = 0;
+  ccd[1].xgetbv_ecx_0 = 0;
+
+  /* Run the CPUID probing on a specific CPU.  There are expected
+     differences for encoding core IDs and topology information in
+     CPUID output, but some firmware/kernel bugs also may result in
+     asymmetric data across CPUs in some cases.
+
+     The CPU mask arrays are large enough for 4096 or 8192 CPUs, which
+     should give ample space for future expansion.  */
+  unsigned long int mask_reference[1024];
+  int length_reference
+    = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0,
+                             sizeof (mask_reference), mask_reference);
+
+  /* A parallel bit mask that is used below to request running on a
+     specific CPU.  */
+  unsigned long int mask_request[array_length (mask_reference)];
+
+  if (length_reference >= sizeof (long))
+    {
+      /* The kernel is supposed to return a multiple of the word size.  */
+      length_reference /= sizeof (long);
+
+      for (unsigned int i = 0; i < length_reference; ++i)
+        {
+          /* Iterate over the bits in mask_request[i] and process
+             those that are set; j is the bit index, bitmask is the
+             derived mask for the bit at this index.  */
+          unsigned int j = 0;
+          for (unsigned long int bitmask = 1; bitmask != 0; bitmask <<= 1, ++j)
+            {
+              mask_request[i] = mask_reference[i] & bitmask;
+              if (mask_request[i])
+                {
+                  unsigned int mask_array_length
+                    = (i + 1) * sizeof (unsigned long int);
+                  if (INTERNAL_SYSCALL_CALL (sched_setaffinity, 0,
+                                             mask_array_length,
+                                             mask_request) == 0)
+                    {
+                      /* This is the CPU ID number used by the
+                         kernel.  It should match the first result
+                         from getcpu.  */
+                      int requested_cpu = i * ULONG_WIDTH + j;
+                      _dl_diagnostics_cpuid_collect
+                        (&ccd[processor_index & 1]);
+                      _dl_diagnostics_cpuid_report
+                        (processor_index, requested_cpu,
+                         &ccd[processor_index & 1],
+                         &ccd[!(processor_index & 1)]);
+                      ++processor_index;
+                    }
+                }
+            }
+          /* Reset the mask word, so that the mask has always
+             population count one.  */
+          mask_request[i] = 0;
+        }
+    }
+
+  /* Fallback if we could not deliberately select a CPU.  */
+  if (processor_index == 0)
+    {
+      _dl_diagnostics_cpuid_collect (&ccd[0]);
+      _dl_diagnostics_cpuid_report (processor_index, -1, &ccd[0], &ccd[1]);
+    }
+}