Initial implementation of -Whomoglyph [PR preprocessor/103027]

Message ID 20211101211412.1123930-1-dmalcolm@redhat.com
State New
Headers
Series Initial implementation of -Whomoglyph [PR preprocessor/103027] |

Commit Message

David Malcolm Nov. 1, 2021, 9:14 p.m. UTC
  [Resending to get around mailing list size limit; see notes below]

This patch implements a new -Whomoglyph diagnostic, enabled by default.

Internally it implements the "skeleton" algorithm from:
  http://www.unicode.org/reports/tr39/#Confusable_Detection
so that every new identifier is mapped to a "skeleton", and if
the skeleton is already in use by a different identifier, issue
a -Whomoglyph diagnostic.
It uses the data from:
  https://www.unicode.org/Public/security/13.0.0/confusables.txt
to determine which characters are confusable.

For example, given the example of CVE-2021-42694 at
https://trojansource.codes/, with this patch we emit:

t.cc:7:1: warning: identifier ‘sayНello’ (‘say\u041dello’)... [CWE-1007] [-Whomoglyph]
    7 | void say<U+041D>ello() {
      | ^~~~
t.cc:3:1: note: ...confusable with non-equal identifier ‘sayHello’ here
    3 | void sayHello() {
      | ^~~~

(the precise location of the token isn't quite right; the
identifiers should be underlined, rather than the "void" tokens)

This takes advantage of:
  "diagnostics: escape non-ASCII source bytes for certain diagnostics"
    https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583020.html
to escape non-ASCII characters when printing a source line for -Whomoglyph,
so that we print "say<U+041D>ello" when quoting the source line, making it
clearer that this is not "sayHello".

In order to implement "skeleton", I had to implement NFD support, so the
patch also contains some UTF-32 support code.

Known issues:
- I'm doing an extra hash_table lookup on every identifier lookup.
  I haven't yet measured the impact on the speed of the compiler.
  If this is an issue, is there a good place to stash an extra
  pointer in every identifier?
- doesn't yet bootstrap, as the confusables.txt data contains ASCII
  to ASCII confusables, leading to warnings such as:
../../.././gcc/options.h:11273:3: warning: identifier ‘OPT_l’... [CWE-1007] [-Whomoglyph]
../../.././gcc/options.h:9959:3: note: ...confusable with non-equal identifier ‘OPT_I’ (‘OPT_I’) here
  Perhaps the option should have levels, where we don't complain about
  pure ASCII confusables at the default level?
- no docs yet
- as noted above the location_t of the token isn't quite right
- map_identifier_to_skeleton and map_skeleton_to_first_use aren't
  yet integrated with the garbage collector
- some other FIXMEs in the patch

[I had to trim the patch for space to get it to pass the size filter on the
mailing list; I trimmed:
  contrib/unicode/confusables.txt, 
  gcc/testsuite/selftests/NormalizationTest.txt
which can be downloaded from the URLs in the ChangeLog, and:
  gcc/confusables.inc
  gcc/decomposition.inc
which can be generated using the scripts in the patch ]

Thoughts?
Dave

contrib/ChangeLog:
	PR preprocessor/103027
	* unicode/confusables.txt: New file, dowloaded from
	https://www.unicode.org/Public/security/13.0.0/confusables.txt
	* unicode/gen-confusables-inc.py: New file.
	* unicode/gen-decomposition-inc.py: New file.

gcc/ChangeLog:
	PR preprocessor/103027
	* common.opt (Whomoglyph): New.
	* confusables.inc: New file, generated by
	contrib/unicode/gen-confusables-inc.py.
	* decomposition.inc: New file, generated by
	contrib/unicode/gen-decomposition-inc.py.
	* selftest-run-tests.c (selftest::run_tests): Call
	stringpool_c_tests.
	* selftest.c (note_formatted): New.
	* selftest.h (note_formatted): New decl.
	(stringpool_c_tests): New decl.
	* stringpool.c: Include "cpplib.h", "pretty-print.h",
	"selftest.h", "diagnostic.h", "diagnostic-metadata.h",
	"gcc-rich-location.h"
	(init_stringpool): Set on_new_node and on_existing_node callbacks.
	(print_escaped_codepoint): New.
	(parse_hex): New.
	(class utf32_string): New.
	(class utf8_string): New.
	(utf32_string::append_decomposition): New.
	(utf32_string::convert_to_nfd): New.
	(cmp_combining_class): New.
	(utf32_string::sort_by_combining_class): New.
	(convert_homoglyphs_to_exemplars): New.
	(get_tr39_skeleton): New.
	(struct first_use): New.
	(map_identifier_to_skeleton): New.
	(map_skeleton_to_first_use): New.
	(class escaped_identifier): New.
	(maybe_warn_on_homoglyph): New.
	(stringpool_on_new_hashnode): New.
	(stringpool_on_existing_hashnode): New.
	(selftest::assert_dump_eq): New.
	(ASSERT_DUMP_EQ): New.
	(selftest::test_utf32_from_utf8): New.
	(selftest::test_from_hex): New.
	(selftest::test_combining_classes): New.
	(selftest::assert_utf32_eq_at): New.
	(ASSERT_UTF32_EQ_AT): New.
	(selftest::test_normalization_line): New.
	(selftest::test_normalization): New.
	(selftest::test_tr39_skeleton_1): New.
	(selftest::test_tr39_skeleton_2): New.
	(selftest::test_tr39_skeleton_3): New.
	(selftest::test_tr39_skeleton_4): New.
	(selftest::test_tr39_skeleton_5): New.
	(selftest::stringpool_c_tests): New.

gcc/testsuite/ChangeLog:
	PR preprocessor/103027
	* c-c++-common/Whomoglyph-1.c: New test.
	* c-c++-common/Whomoglyph-2.c: New test.
	* c-c++-common/Whomoglyph-3.c: New test.
	* selftests/NormalizationTest.txt: New file, dowloaded from
	https://www.unicode.org/Public/13.0.0/ucd/NormalizationTest.txt

libcpp/ChangeLog:
	PR preprocessor/103027
	* charset.c (cpp_combining_class): New.
	* include/cpplib.h (cpp_warning_at): New decl.
	(cpp_combining_class): New decl.
	* include/symtab.h (ht::on_new_node): New callback field.
	(ht::on_existing_node): New callback field.
	* symtab.c (ht_lookup_with_hash): Call the new callbacks

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
---
 contrib/unicode/confusables.txt               |  9638 +++++
 contrib/unicode/gen-confusables-inc.py        |   120 +
 contrib/unicode/gen-decomposition-inc.py      |    41 +
 gcc/common.opt                                |     4 +
 gcc/confusables.inc                           | 34266 ++++++++++++++++
 gcc/decomposition.inc                         | 11337 +++++
 gcc/selftest-run-tests.c                      |     1 +
 gcc/selftest.c                                |    16 +
 gcc/selftest.h                                |     7 +
 gcc/stringpool.c                              |  1199 +
 gcc/testsuite/c-c++-common/Whomoglyph-1.c     |    18 +
 gcc/testsuite/c-c++-common/Whomoglyph-2.c     |    18 +
 gcc/testsuite/c-c++-common/Whomoglyph-3.c     |    20 +
 gcc/testsuite/selftests/NormalizationTest.txt | 18908 +++++++++
 libcpp/charset.c                              |    23 +
 libcpp/include/cpplib.h                       |     5 +
 libcpp/include/symtab.h                       |     4 +
 libcpp/symtab.c                               |    15 +-
 18 files changed, 75638 insertions(+), 2 deletions(-)
 create mode 100644 contrib/unicode/confusables.txt
 create mode 100644 contrib/unicode/gen-confusables-inc.py
 create mode 100644 contrib/unicode/gen-decomposition-inc.py
 create mode 100644 gcc/confusables.inc
 create mode 100644 gcc/decomposition.inc
 create mode 100644 gcc/testsuite/c-c++-common/Whomoglyph-1.c
 create mode 100644 gcc/testsuite/c-c++-common/Whomoglyph-2.c
 create mode 100644 gcc/testsuite/c-c++-common/Whomoglyph-3.c
 create mode 100644 gcc/testsuite/selftests/NormalizationTest.txt
  

Comments

Jakub Jelinek Nov. 2, 2021, 11:56 a.m. UTC | #1
On Mon, Nov 01, 2021 at 05:14:12PM -0400, David Malcolm via Gcc-patches wrote:
> Thoughts?

Resending my previously internally posted mail:

Thanks for working for this, but I'm afraid this is done at a wrong
location.

Consider attached testcases Whomoglyph1.C and Whomoglyph2.C.
On Whomoglyph1.C testcase, I'd expect a warning, because there is a clear
confusion for the reader, something that isn't visible in any of emacs, vim,
joe editors or on the terminal, when f3 uses scope identifier, the casual
reader will expect that it uses N1::N2::scope, but there is no such
variable, only one N1::N2::ѕсоре that visually looks the same, but has
different UTF-8 chars in it.  So, name lookup will instead find N1::scope
and use that.
But Whomoglyph2.C will emit warnings that are IMHO not appropriate,
I believe there is no confusion at all there, e.g. for both C and C++,
the f5/f6 case, it doesn't really matter how each of the function names its
own parameter, one can never access another function's parameter.
Ditto for different namespace provided that both namespaces aren't searched
in the same name lookup, or similarly classes etc.
So, IMNSHO that warning belongs to name-lookup (cp/name-lookup.c for the C++
FE).
And, another important thing is that most users don't really use unicode in
identifiers, I bet over 99.9% of identifiers don't have any >= 0x80
characters in it and even when people do use them, confusable identifiers
during the same lookup are even far more unlikely.
So, I think we should optimize for the common case, ASCII only identifiers
and spend as little compile time as possible on this stuff.

Another thing we've discussed on IRC is that the Unicode confusables.txt has
some canonicalizations that I think most users will not appreciate, or at
least not appreciate if it is warned by default.  In particular, the
canonicalizations from ASCII to ASCII:
0 -> O
1 -> l
I -> l
m -> rn
I think most users use monospace fonts when working on C/C++ etc. source
code, anything else ruins indentation (ok, except for projects that indent
everything by tabs only).  In monospace fonts, I think m can't be confused
with rn.  And in most monospace fonts, 0 and O and 1/l/I are also fairly
well distinguishable.  So, I think we should warn about f00 vs. fOO or
home vs. horne or f1 vs. fl vs. fI only with -Whomoglyph=2, a non-default
extra level, rather than by default.
But, given the way confusables.txt is generated, while it is probably
easy to undo the m -> rn canonicalizations (assume everything that maps
to rn actually maps to m), for the 0 vs. O or 1 vs. l vs. I case I'm afraid
trying to differentiate to what it actually maps will be harder.
So, my proposal would be by default not to warn for the 0 vs. O, 1 vs. l vs. I
and m vs. rn differences if neither of the identifiers contains any UTF-8
chars and do warn otherwise.

For spending as short compile time as possible on this, I think libcpp
knows well if a UTF-8 or UCN character has been encountered in an identifier
(it uses the libcpp/charset.c entrypoints then), so perhaps we could mark
the tokens with a new flag that stands for "identifier contains at least one
UTF-8 (non-ASCII) or UCN character.  Or if we don't, perhaps just quickly
scanning identifier for characters with MSB set might be tollerable too.
So, that for how to identify identifiers for which we should compute the
canonical homoglyph form at all.
How to store it, again as we should IMHO optimize for the usual case where even
identifiers with UTF-8/UCN characters canonicalize to itself, I think best
representation for that would be not to waste whole pointer on each
IDENTIFIER_NODE for it, but instead reserve for it one bit on the
identifiers, "identifier has a canonical homoglyph version different from
itself", and that bit would mean we have some hash map etc. on the side
that maps it to the canonical identifier.
As for the actual uses of confusables.txt transformations, do you think
we need to work on UTF-32?  Can't it all be done on UTF-8 instead?  Let
whatever confusable.txt -> something.h generator we write prepare a decision
function that can use UTF-8 in both what to replace and certainly just a
UTF-8 string literal in what to replace it with.
Note, the above would mean we don't compute them for those 0 -> O, [1I] -> l
or m -> rn canonicalizations for the default -Whomoglyph mode.
Perhaps we can use some bit/flag on the C FE scopes or C++ namespaces
and on the identifiers we map to through the hash map.  On the scopes etc.
it would mean this scope has some identifiers in it that have the homoglyphs
alternatives and the homoglyph canonical forms have been already added to
wherever the name lookup can find them.  And on the canonical forms perhaps
stand for this canonical form has any O, l or rn sequences in it.
And then during actual name lookup, if current identifier doesn't the
has alt homoglyph canonicalization flag set and the scope doesn't have
the new bit set either, there would be nothing further to do, for
name-lookup where the identifier we search for has the new bit set
and the scope has it too we'd do two lookups, one normal and one for the
alt homoglyph canonical form, and if the scope doesn't have the new bit
set, we'd perform whatever is needed to figure that out.

	Jakub
// { dg-do compile }
// { dg-options "-Whomoglyph" }

namespace N1
{
  int scope; // ASCII
  int f1 () { return scope++; } // ASCII
  namespace N2
  {
    int ѕсоре; // Cyrillic
    int f2 () { return ѕсоре++; } // Cyrillic
    int f3 () { return scope++; } // ASCII	{ dg-warning "Whomoglyph" }
  }
}
// { dg-do compile }
// { dg-options "-Whomoglyph" }

namespace N1
{
  int scope; // ASCII
  int f1 () { return scope++; } // ASCII
}
namespace N2
{
  int ѕсоре; // Cyrillic	{ dg-bogus "Whomoglyph" }
  int f2 () { return ѕсоре++; } // Cyrillic
}
struct S1
{
  constexpr static int Hello = 1; // ASCII
  int f3 () { return Hello; } // ASCII
};
struct S2
{
  constexpr static int Ηello = 2; // Greek+ASCII	{ dg-bogus "Whomoglyph" }
  int f4 () { return Ηello; } // Greek+ASCII
};
int f5 (int s) // ASCII
{
  return s; // ASCII
}
int f6 (int ѕ) // Cyrillic	{ dg-bogus "Whomoglyph" }
{
  return ѕ; // Cyrillic
}
  
Jakub Jelinek Nov. 2, 2021, 12:06 p.m. UTC | #2
On Tue, Nov 02, 2021 at 12:56:53PM +0100, Jakub Jelinek wrote:
> Consider attached testcases Whomoglyph1.C and Whomoglyph2.C.
> On Whomoglyph1.C testcase, I'd expect a warning, because there is a clear
> confusion for the reader, something that isn't visible in any of emacs, vim,
> joe editors or on the terminal, when f3 uses scope identifier, the casual
> reader will expect that it uses N1::N2::scope, but there is no such
> variable, only one N1::N2::ѕсоре that visually looks the same, but has
> different UTF-8 chars in it.  So, name lookup will instead find N1::scope
> and use that.
> But Whomoglyph2.C will emit warnings that are IMHO not appropriate,
> I believe there is no confusion at all there, e.g. for both C and C++,
> the f5/f6 case, it doesn't really matter how each of the function names its
> own parameter, one can never access another function's parameter.
> Ditto for different namespace provided that both namespaces aren't searched
> in the same name lookup, or similarly classes etc.
> So, IMNSHO that warning belongs to name-lookup (cp/name-lookup.c for the C++
> FE).
> And, another important thing is that most users don't really use unicode in
> identifiers, I bet over 99.9% of identifiers don't have any >= 0x80
> characters in it and even when people do use them, confusable identifiers
> during the same lookup are even far more unlikely.
> So, I think we should optimize for the common case, ASCII only identifiers
> and spend as little compile time as possible on this stuff.

If we keep doing it in the stringpool, then e.g. one couldn't
#include <zlib.h>
in a program with Russian/Ukrainian/Serbian etc. identifiers where some parameter
or automatic variable etc. in some function in that file is called
с (Cyrillic letter es), etc. just because in zlib.h one of the arguments
in one of the function prototypes is called c (latin small letter c).
I'd be afraid most of the users that actually want to use UTF-8 or UCNs in
their identifiers would then just need to disable this warning...

	Jakub
  
Martin Sebor Nov. 2, 2021, 7:49 p.m. UTC | #3
On 11/1/21 3:14 PM, David Malcolm via Gcc-patches wrote:
> [Resending to get around mailing list size limit; see notes below]
> 
> This patch implements a new -Whomoglyph diagnostic, enabled by default.
> 
> Internally it implements the "skeleton" algorithm from:
>    http://www.unicode.org/reports/tr39/#Confusable_Detection
> so that every new identifier is mapped to a "skeleton", and if
> the skeleton is already in use by a different identifier, issue
> a -Whomoglyph diagnostic.
> It uses the data from:
>    https://www.unicode.org/Public/security/13.0.0/confusables.txt
> to determine which characters are confusable.
> 
> For example, given the example of CVE-2021-42694 at
> https://trojansource.codes/, with this patch we emit:
> 
> t.cc:7:1: warning: identifier ‘sayНello’ (‘say\u041dello’)... [CWE-1007] [-Whomoglyph]
>      7 | void say<U+041D>ello() {
>        | ^~~~
> t.cc:3:1: note: ...confusable with non-equal identifier ‘sayHello’ here
>      3 | void sayHello() {
>        | ^~~~
> 
> (the precise location of the token isn't quite right; the
> identifiers should be underlined, rather than the "void" tokens)
> 
> This takes advantage of:
>    "diagnostics: escape non-ASCII source bytes for certain diagnostics"
>      https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583020.html
> to escape non-ASCII characters when printing a source line for -Whomoglyph,
> so that we print "say<U+041D>ello" when quoting the source line, making it
> clearer that this is not "sayHello".
> 
> In order to implement "skeleton", I had to implement NFD support, so the
> patch also contains some UTF-32 support code.
> 
> Known issues:
> - I'm doing an extra hash_table lookup on every identifier lookup.
>    I haven't yet measured the impact on the speed of the compiler.
>    If this is an issue, is there a good place to stash an extra
>    pointer in every identifier?
> - doesn't yet bootstrap, as the confusables.txt data contains ASCII
>    to ASCII confusables, leading to warnings such as:
> ../../.././gcc/options.h:11273:3: warning: identifier ‘OPT_l’... [CWE-1007] [-Whomoglyph]
> ../../.././gcc/options.h:9959:3: note: ...confusable with non-equal identifier ‘OPT_I’ (‘OPT_I’) here
>    Perhaps the option should have levels, where we don't complain about
>    pure ASCII confusables at the default level?
> - no docs yet
> - as noted above the location_t of the token isn't quite right
> - map_identifier_to_skeleton and map_skeleton_to_first_use aren't
>    yet integrated with the garbage collector
> - some other FIXMEs in the patch
> 
> [I had to trim the patch for space to get it to pass the size filter on the
> mailing list; I trimmed:
>    contrib/unicode/confusables.txt,
>    gcc/testsuite/selftests/NormalizationTest.txt
> which can be downloaded from the URLs in the ChangeLog, and:
>    gcc/confusables.inc
>    gcc/decomposition.inc
> which can be generated using the scripts in the patch ]
> 
> Thoughts?

None from me on the actual feature -- even after our discussion
this morning I remain comfortably ignorant of the problem :)
I just have a quick comment on the two new string classes:

...
> +
> +/* A class for manipulating UTF-32 strings.  */
> +
> +class utf32_string
> +{
...
> + private:
...
> +  cppchar_t *m_buf;
> +  size_t m_alloc_len;
> +  size_t m_len;
> +};
> +
> +/* A class for constructing UTF-8 encoded strings.
> +   These are not NUL-terminated.  */
> +
> +class utf8_string
> +{
...
> + private:
> +  uchar  *m_buf;
> +  size_t m_alloc_sz;
> +  size_t m_len;
> +};

There are container abstractions both in C++ and in GCC that
these classes look like they could be implemented in terms of:
I'm thinking of std::string, std::vector, vec, and auto_vec.
They have the additional advantage of being safely copyable
and assignable, and of course, of having already been tested.
I see that the classes in your patch provide additional
functionality that the abstractions don't.  I'd expect it
be doable on top of the abstractions and without
reimplementing all the basic buffer management.

Martin
  

Patch

diff --git a/contrib/unicode/gen-confusables-inc.py b/contrib/unicode/gen-confusables-inc.py
new file mode 100644
index 00000000000..a19117ee6de
--- /dev/null
+++ b/contrib/unicode/gen-confusables-inc.py
@@ -0,0 +1,120 @@ 
+#!/usr/bin/env python3
+#
+# Script to generate confusables.inc from confusables.txt
+#
+# This file is part of GCC.
+#
+# GCC is free software; you can redistribute it and/or modify it under
+# the terms of the GNU General Public License as published by the Free
+# Software Foundation; either version 3, or (at your option) any later
+# version.
+#
+# GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+# WARRANTY; without even the implied warranty of MERCHANTABILITY or
+# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+# for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with GCC; see the file COPYING3.  If not see
+# <http://www.gnu.org/licenses/>.  */
+
+#from pprint import pprint
+from collections import namedtuple
+import re
+import sys
+
+class Confusable(namedtuple('Confusable', ['src_chr', 'dst_chrs', 'comment'])):
+    pass
+
+class Confusables:
+    def __init__(self):
+        self.date = None
+        self.version = None
+        self.items = []
+
+    @staticmethod
+    def from_stream(f_in):
+        c = Confusables()
+        for line in f_in.readlines():
+            c.parse_line(line)
+        return c
+
+    def parse_line(self, line):
+        """
+        Parse a line of confusables.txt as specified in
+        http://www.unicode.org/reports/tr39/#Confusable_Detection
+        """
+        #print(repr(line))
+        m = re.match(r'# Date: (.+)', line)
+        if m:
+            self.date = m.group(1)
+            return
+        hexdigits_group = r'([0-9A-F]+)'
+        hexdigits_groups = r'([0-9A-F ]+)'
+        opt_ws = r'\s*'
+        pattern = ('^' + hexdigits_group + opt_ws + ';'
+                   + opt_ws + hexdigits_groups + opt_ws + ';'
+                   + opt_ws + 'MA' + opt_ws + '#(.*)$')
+        m = re.match(pattern, line)
+        if m:
+            # Convert from hex-encoded codepoint to a unicode character:
+            src_chr = chr(int(m.group(1), 16))
+
+            # Convert from space-separated hex-encoded codepoints to a
+            # unicode string:
+            dst_chrs = ''.join([chr(int(hstr, 16))
+                                for hstr in m.group(2).split()])
+            comment = m.group(3)
+            self.items.append(Confusable(src_chr, dst_chrs, comment))
+            #print(repr(self.items[-1]))
+            return
+        # Verify that we parsed all items
+        m = re.match(r'# total: ([0-9]+)', line)
+        if m:
+            total = int(m.group(1))
+            if total != len(self.items):
+                raise ValueError('mismatching total: %i; len(self.items): %i'
+                                 % (total, len(self.items)))
+
+    def write_as_inc(self, f_out):
+        f_out.write('/* Generated from unicode confusables.txt\n'
+                    '   Date: %s\n'
+                    '   Version: %s */\n' % (self.date, self.version))
+        f_out.write('/* Define the following macros:\n'
+                    '     BEGIN_SRC(CODEPOINT) : a handler for the codepoint.\n'
+                    '     DST(CODEPOINT): used to emit each destination codepoint.\n'
+                    '     END_SRC: finish this src codepoint.\n  */')
+        for item in self.items:
+            f_out.write('/* %s */\n' % item.comment)
+
+            # Verify that the only remappings we see from ASCII are the
+            # ones we expect
+            if (ord(item.src_chr) < 0x80
+                # U+0022 QUOTATION MARK
+                and ord(item.src_chr) != 0x22
+                # U+0025 PERCENT SIGN
+                and ord(item.src_chr) != 0x25
+                # U+0030 DIGIT ZERO
+                and ord(item.src_chr) != 0x30
+                # U+0031 DIGIT ONE
+                and ord(item.src_chr) != 0x31
+                # U+0049 LATIN CAPITAL LETTER I
+                and ord(item.src_chr) != 0x49
+                # U+0060 GRAVE ACCENT
+                and ord(item.src_chr) != 0x60
+                # U+006D LATIN SMALL LETTER M
+                and ord(item.src_chr) != 0x6d
+                # U+007C VERTICAL LINE
+                and ord(item.src_chr) != 0x7c):
+                raise ValueError('unexpected remapping from ASCII (0x%x)\n'
+                                 % (ord(item.src_chr)))
+
+            f_out.write('BEGIN_SRC (0x%x)\n' % (ord(item.src_chr)))
+            for dst_chr in item.dst_chrs:
+                f_out.write('  DST (0x%x);\n' % ord(dst_chr))
+            f_out.write('END_SRC\n\n')
+
+with open('confusables.txt') as f_in:
+    c = Confusables.from_stream(f_in)
+with open('../../gcc/confusables.inc', 'w') as f_out:
+    c.write_as_inc(f_out)
diff --git a/contrib/unicode/gen-decomposition-inc.py b/contrib/unicode/gen-decomposition-inc.py
new file mode 100644
index 00000000000..a50771dfd9d
--- /dev/null
+++ b/contrib/unicode/gen-decomposition-inc.py
@@ -0,0 +1,41 @@ 
+from pprint import pprint
+import re
+
+from from_glibc.unicode_utils import fill_attributes, UNICODE_ATTRIBUTES
+
+def parse_decomp(buf):
+    result = []
+    for hexdigits_group in buf.split():
+        ch = chr(int(hexdigits_group, 16))
+        result.append(ch)
+    return result
+
+fill_attributes('UnicodeData.txt')
+
+with open('../../gcc/decomposition.inc', 'w') as f_out:
+    f_out.write('/* Generated from UnicodeData.txt\n'
+                '   Define the following macros:\n'
+                '     BEGIN_SRC(CODEPOINT) : a handler for the codepoint.\n'
+                '     DST(CODEPOINT): used to emit each destination codepoint.\n'
+                '     END_SRC: finish this src codepoint.  */\n\n')
+
+    for codepoint in UNICODE_ATTRIBUTES:
+        attribs = UNICODE_ATTRIBUTES[codepoint]
+        if 0:
+            pprint(attribs)
+        if attribs['decomposition']:
+            #print('%r: %r: %r' % (codepoint, attribs['name'], attribs['decomposition']))
+            m = re.match(r'(<[a-zA-Z]+>)(.*)', attribs['decomposition'])
+            if m:
+                # print('has tag: %r' % m.group(1))
+                #compat_decomp = parse_decomp(m.group(2))
+                # Tagged decomposition mappings are compatibility; skip these
+                continue
+            else:
+                # Untagged decomposition mappings are canonical; write these
+                decomp = parse_decomp(attribs['decomposition'])
+            f_out.write('/* %s */\n' % attribs['name'])
+            f_out.write('BEGIN_SRC (0x%x)\n' % codepoint)
+            for dst_chr in decomp:
+                f_out.write('  DST (0x%x);\n' % ord(dst_chr))
+            f_out.write('END_SRC\n\n')
diff --git a/gcc/common.opt b/gcc/common.opt
index 1a5b9bfcca9..5cc73be0513 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -618,6 +618,10 @@  Wfree-nonheap-object
 Common Var(warn_free_nonheap_object) Init(1) Warning
 Warn when attempting to free a non-heap object.
 
+Whomoglyph
+Common Var(warn_homoglyph) Init(1) Warning
+Warn when an identifier has the same appearance as a different identifier
+
 Whsa
 Common Ignore Warning
 Does nothing.  Preserved for backward compatibility.
diff --git a/gcc/selftest-run-tests.c b/gcc/selftest-run-tests.c
index 6a8f291f5dd..0eaff1afb08 100644
--- a/gcc/selftest-run-tests.c
+++ b/gcc/selftest-run-tests.c
@@ -80,6 +80,7 @@  selftest::run_tests ()
   opt_problem_cc_tests ();
   ordered_hash_map_tests_cc_tests ();
   splay_tree_cc_tests ();
+  stringpool_c_tests ();
 
   /* Mid-level data structures.  */
   input_c_tests ();
diff --git a/gcc/selftest.c b/gcc/selftest.c
index 13ec4f0f61c..07625c1be7e 100644
--- a/gcc/selftest.c
+++ b/gcc/selftest.c
@@ -63,6 +63,22 @@  fail_formatted (const location &loc, const char *fmt, ...)
   abort ();
 }
 
+/* As "fail_formatted", but do not abort.
+   Use this *before* failing to provide extra information.  */
+
+void
+note_formatted (const location &loc, const char *fmt, ...)
+{
+  va_list ap;
+
+  fprintf (stderr, "%s:%i: %s: NOTE: ", loc.m_file, loc.m_line,
+	   loc.m_function);
+  va_start (ap, fmt);
+  vfprintf (stderr, fmt, ap);
+  va_end (ap);
+  fprintf (stderr, "\n");
+}
+
 /* Implementation detail of ASSERT_STREQ.
    Compare val1 and val2 with strcmp.  They ought
    to be non-NULL; fail gracefully if either or both are NULL.  */
diff --git a/gcc/selftest.h b/gcc/selftest.h
index 24ef57cb6cc..2e069e3e1e4 100644
--- a/gcc/selftest.h
+++ b/gcc/selftest.h
@@ -65,6 +65,12 @@  extern void fail (const location &loc, const char *msg)
 extern void fail_formatted (const location &loc, const char *fmt, ...)
   ATTRIBUTE_PRINTF_2 ATTRIBUTE_NORETURN;
 
+/* As "fail_formatted", but do not abort.
+   Use this *before* failing to provide extra information.  */
+
+extern void note_formatted (const location &loc, const char *fmt, ...)
+  ATTRIBUTE_PRINTF_2;
+
 /* Implementation detail of ASSERT_STREQ.  */
 
 extern void assert_streq (const location &loc,
@@ -257,6 +263,7 @@  extern void spellcheck_tree_c_tests ();
 extern void splay_tree_cc_tests ();
 extern void sreal_c_tests ();
 extern void store_merging_c_tests ();
+extern void stringpool_c_tests ();
 extern void tree_c_tests ();
 extern void tree_cfg_c_tests ();
 extern void tree_diagnostic_path_cc_tests ();
diff --git a/gcc/stringpool.c b/gcc/stringpool.c
index 2f21466af24..843701310f2 100644
--- a/gcc/stringpool.c
+++ b/gcc/stringpool.c
@@ -29,11 +29,20 @@  along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "tree.h"
+#include "cpplib.h"
+#include "pretty-print.h"
+#include "selftest.h"
+#include "diagnostic.h"
+#include "diagnostic-metadata.h"
+#include "gcc-rich-location.h"
 
 struct ht *ident_hash;
 
 static hashnode alloc_node (cpp_hash_table *);
+static void stringpool_on_new_hashnode (hashnode);
+static void stringpool_on_existing_hashnode (hashnode);
 static int mark_ident (struct cpp_reader *, hashnode, const void *);
+static void maybe_warn_on_homoglyph (tree id);
 
 static void *
 stringpool_ggc_alloc (size_t x)
@@ -54,6 +63,8 @@  init_stringpool (void)
   ident_hash = ht_create (14);
   ident_hash->alloc_node = alloc_node;
   ident_hash->alloc_subobject = stringpool_ggc_alloc;
+  ident_hash->on_new_node = stringpool_on_new_hashnode;
+  ident_hash->on_existing_node = stringpool_on_existing_hashnode;
 }
 
 /* Allocate a hash node.  */
@@ -83,6 +94,557 @@  ggc_alloc_string (const char *contents, int length MEM_STAT_DECL)
   return (const char *) result;
 }
 
+/* Print CH to PP; if non-ASCII or non-printable, escape it as \uXXXX
+   or \UXXXXXXXX where X are hexadecimal digits.  */
+
+static void
+print_escaped_codepoint (pretty_printer *pp, cppchar_t ch)
+{
+  if (ch < 0x80 && ISPRINT (ch))
+    pp_character (pp, ch);
+  else if (ch <= 0xffff)
+    {
+      pp_string (pp, "\\u");
+      for (int j = 3; j >= 0; j--)
+	pp_character (pp, "0123456789abcdef"[(ch >> (4 * j)) & 0xF]);
+    }
+  else
+    {
+      pp_string (pp, "\\U");
+      for (int j = 7; j >= 0; j--)
+	pp_character (pp, "0123456789abcdef"[(ch >> (4 * j)) & 0xF]);
+    }
+}
+
+/* Attempt to parse a hexadecimal number from BUF of up to length LEN,
+   skipping leading non-digits, and potentially skipping trailing non-digits.
+   This is the format within columns of
+   https://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
+
+   Return the number of bytes consumed.
+   Write the result (if any) to *OUT.  */
+
+static size_t
+parse_hex (const char *buf, size_t len, cppchar_t *out)
+{
+  size_t chars_consumed = 0;
+  cppchar_t ch = 0;
+  unsigned num_digits = 0;
+  while (len > 0)
+    {
+      unsigned digit = 0;
+      switch (buf[0])
+	{
+	default:
+	  /* Not a hex digit.
+	     Skip until we we a hex digit, then consume them.  */
+	  buf++;
+	  chars_consumed++;
+	  len--;
+	  if (num_digits == 0)
+	    continue;
+	  else
+	    /* First non-hex-digit after a hex digit.
+	       Terminate.  */
+	    {
+	      *out = ch;
+	      return chars_consumed;
+	    }
+
+	/* Hex digits.  */
+	case '0': digit = 0; break;
+	case '1': digit = 1; break;
+	case '2': digit = 2; break;
+	case '3': digit = 3; break;
+	case '4': digit = 4; break;
+	case '5': digit = 5; break;
+	case '6': digit = 6; break;
+	case '7': digit = 7; break;
+	case '8': digit = 8; break;
+	case '9': digit = 9; break;
+	case 'a': case 'A': digit = 10; break;
+	case 'b': case 'B': digit = 11; break;
+	case 'c': case 'C': digit = 12; break;
+	case 'd': case 'D': digit = 13; break;
+	case 'e': case 'E': digit = 14; break;
+	case 'f': case 'F': digit = 15; break;
+	}
+      /* Handling hex digits.  */
+      ch <<= 4;
+      ch += digit;
+      num_digits++;
+      buf++;
+      chars_consumed++;
+      len--;
+    }
+  *out = ch;
+  return chars_consumed;
+}
+
+/* A class for manipulating UTF-32 strings.  */
+
+class utf32_string
+{
+ public:
+  /* Construct an empty utf32_string, preallocated to hold ALLOC_LEN
+     codepoints.  */
+  utf32_string (size_t alloc_len)
+  : m_buf ((cppchar_t *)xmalloc (alloc_len * sizeof (cppchar_t))),
+    m_alloc_len (alloc_len),
+    m_len (0)
+  {
+  }
+
+  ~utf32_string () { free (m_buf); }
+
+  size_t get_length () const { return m_len; }
+
+  cppchar_t operator[] (size_t idx) const
+  {
+    gcc_assert (idx < m_len);
+    return m_buf[idx];
+  }
+
+  bool operator== (const utf32_string &other) const
+  {
+    if (m_len != other.m_len)
+      return false;
+    return memcmp (m_buf, other.m_buf, m_len * sizeof (cppchar_t)) == 0;
+  }
+
+  bool operator!= (const utf32_string &other) const
+  {
+    return !(*this == other);
+  }
+
+  void dump_to_pp (pretty_printer *pp) const
+  {
+    pp_character (pp, '"');
+    for (size_t idx = 0; idx < m_len; idx++)
+      print_escaped_codepoint (pp, m_buf[idx]);
+    pp_character (pp, '"');
+    // TODO: length, alloc sz, etc
+  }
+
+  DEBUG_FUNCTION void dump () const
+  {
+    pretty_printer pp;
+    pp.buffer->stream = stderr;
+    dump_to_pp (&pp);
+    pp_newline (&pp);
+    pp_flush (&pp);
+  }
+
+  static utf32_string
+  from_identifier (tree id)
+  {
+   return from_utf8 ((const uchar *)IDENTIFIER_POINTER (id),
+		     IDENTIFIER_LENGTH (id));
+  }
+
+  static utf32_string
+  from_utf8 (const char *utf8)
+  {
+    return from_utf8 ((const unsigned char *)utf8, strlen (utf8));
+  }
+
+  static utf32_string
+  from_utf8 (const unsigned char *buf, size_t len)
+  {
+    utf32_string result (len);
+    size_t idx = 0;
+    while (idx < len)
+      {
+	/* Compute the length of the src UTF-8 codepoint.  */
+	int ucn_len = 0;
+	uchar ch = buf[idx++];
+	for (uchar t = ch; t & 0x80; t <<= 1)
+	  ucn_len++;
+
+	cppchar_t src_utf32 = ch & (0x7F >> ucn_len);
+	for (int ucn_len_c = 1; ucn_len_c < ucn_len; ucn_len_c++)
+	  src_utf32 = (src_utf32 << 6) | (buf[idx++] & 0x3F);
+
+	result.append (src_utf32);
+      }
+    return result;
+  }
+
+  static utf32_string
+  from_cppchar_t (cppchar_t ch)
+  {
+    utf32_string result (1);
+    result.append (ch);
+    return result;
+  }
+
+  /* Convert from space-separated hex-encoded codepoints, as seen
+     in the columns of
+       https://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
+     for example.  */
+
+  static utf32_string
+  from_hex (const char *buf, size_t buf_len)
+  {
+    utf32_string result (3);
+
+    while (buf_len > 0)
+    {
+      cppchar_t next_ch;
+      size_t consumed = parse_hex (buf, buf_len, &next_ch);
+      if (consumed > 0)
+	result.append (next_ch);
+      buf += consumed;
+      buf_len -= consumed;
+    }
+
+    return result;
+  }
+
+  void append (cppchar_t ch)
+  {
+    ensure_space (m_len + 1);
+    m_buf[m_len++] = ch;
+  }
+
+  /* Move ctor.  */
+
+  utf32_string (utf32_string &&other)
+  : m_buf (other.m_buf),
+    m_alloc_len (other.m_alloc_len),
+    m_len (other.m_len)
+  {
+    other.m_buf = NULL;
+  }
+
+  /* Move assignment.  */
+
+  utf32_string &operator= (utf32_string &&other)
+  {
+   free (m_buf);
+   m_buf = other.m_buf;
+   m_alloc_len = other.m_alloc_len;
+   m_len = other.m_len;
+   other.m_buf = NULL;
+   return *this;
+  }
+
+  utf32_string convert_to_nfd () const;
+
+ private:
+  void ensure_space (size_t new_len)
+  {
+    if (m_alloc_len < new_len)
+      {
+	m_alloc_len = new_len * 2;
+	m_buf = (cppchar_t *)xrealloc (m_buf, m_alloc_len * sizeof (cppchar_t));
+      }
+  }
+
+  void append_decomposition (cppchar_t ch);
+  void sort_by_combining_class (size_t start_idx, size_t end_idx);
+
+  cppchar_t *m_buf;
+  size_t m_alloc_len;
+  size_t m_len;
+};
+
+/* A class for constructing UTF-8 encoded strings.
+   These are not NUL-terminated.  */
+
+class utf8_string
+{
+ public:
+  utf8_string (size_t alloc_sz)
+  : m_buf ((uchar *)xmalloc (alloc_sz)), m_alloc_sz (alloc_sz), m_len (0)
+  {
+  }
+  utf8_string (const utf32_string &other)
+  {
+    m_alloc_sz = (other.get_length () * 6) + 1;
+    m_buf = (uchar *)xmalloc (m_alloc_sz);
+    m_len = 0;
+    for (size_t idx = 0; idx < other.get_length (); idx++)
+      {
+	cppchar_t c = other[idx];
+
+	/* Adapted from libcpp/charset.c: one_cppchar_to_utf8 */
+	if (c < 0x80)
+	  quick_append (c);
+	else
+	  {
+	    static const uchar masks[6]
+	      =  { 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC };
+	    static const uchar limits[6]
+	      = { 0x80, 0xE0, 0xF0, 0xF8, 0xFC, 0xFE };
+	    size_t nbytes = 1;
+	    uchar buf[6], *p = &buf[6];
+	    do
+	      {
+		*--p = ((c & 0x3F) | 0x80);
+		c >>= 6;
+		nbytes++;
+	      }
+	    while (c >= 0x3F || (c & limits[nbytes-1]));
+	    *--p = (c | masks[nbytes-1]);
+	    while (p < &buf[6])
+	      quick_append (*p++);
+	  }
+      }
+  }
+
+  ~utf8_string () { free (m_buf); }
+
+  /* Move ctor.  */
+
+  utf8_string (utf8_string &&other)
+  : m_buf (other.m_buf),
+    m_alloc_sz (other.m_alloc_sz),
+    m_len (other.m_len)
+  {
+    other.m_buf = NULL;
+  }
+
+  void dump_to_pp (pretty_printer *pp) const
+  {
+    pp_character (pp, '"');
+    for (size_t idx = 0; idx < m_len; idx++)
+      {
+	uchar byte = m_buf[idx];
+	if (byte < 0x80 && ISPRINT (byte))
+	  pp_character (pp, byte);
+	else
+	  {
+	    pp_string (pp, "\\x");
+	    for (int j = 1; j >= 0; j--)
+	      pp_character (pp,
+			    "0123456789abcdef"[(byte >> (4 * j)) & 0xF]);
+	  }
+      }
+    pp_character (pp, '"');
+  }
+
+  DEBUG_FUNCTION void dump () const
+  {
+    pretty_printer pp;
+    pp.buffer->stream = stderr;
+    dump_to_pp (&pp);
+    pp_newline (&pp);
+    pp_flush (&pp);
+  }
+
+  const uchar *get_buffer () const { return m_buf; }
+  size_t get_length () const { return m_len; }
+
+  void quick_append (uchar ch)
+  {
+    m_buf[m_len++] = ch;
+  }
+
+ private:
+  uchar  *m_buf;
+  size_t m_alloc_sz;
+  size_t m_len;
+};
+
+/* Append the canonical decomposition of CH to this string.
+   Note that this is recursive. For example,
+     1E14 has decomposition: 0112 0300
+   but
+     0112 (LATIN CAPITAL LETTER E WITH MACRON) has decomposition: 0045 0304
+   we need to recursively decompose from:
+     1e14 to 0112 0300 to (0045 0304) 0300
+
+   FIXME: should we unroll the recursion in the data, or do
+   the recursion here?
+   Here we're doing it recursively in code (is this necessary,
+   to allow for the Hangul cases, or could we unroll it in the data?)  */
+
+void
+utf32_string::append_decomposition (cppchar_t ch)
+{
+  /* Hangul has its own algorithmic rules for decomposition,
+     so first deal with these as a special case.
+
+     See "Hangul Syllable Decomposition" within section
+     "3.12 Conjoining Jamo Behavior" of the Unicode standard
+     (pp143-145 of version 14.0).
+
+     These variable names are taken directly from the reference
+     algorithm in the Unicode standard, hence we slightly diverge
+     from GNU variable naming standards for the sake of clarity.  */
+  {
+    const cppchar_t SBase = 0xAC00;
+    const cppchar_t LBase = 0x1100;
+    const cppchar_t VBase = 0x1161;
+    const cppchar_t TBase = 0x11A7;
+    const cppchar_t LCount = 19;
+    const cppchar_t VCount = 21;
+    const cppchar_t TCount = 28;
+    const cppchar_t NCount = 588;
+    STATIC_ASSERT (NCount == VCount * TCount);
+    const cppchar_t SCount = 11172;
+    STATIC_ASSERT (SCount == LCount * NCount);
+    if (ch >= SBase && ch < (SBase + SCount))
+      {
+	const cppchar_t SIndex = ch - SBase;
+	const cppchar_t L = LBase + SIndex / NCount;
+	const cppchar_t V = VBase + (SIndex % NCount) / TCount;
+	const cppchar_t T = TBase + SIndex % TCount;
+	append_decomposition (L);
+	append_decomposition (V);
+	if (T != TBase)
+	  append_decomposition (T);
+	return;
+      }
+  }
+
+  /* Otherwise, use the decomposition mappings from the
+     Unicode Character Database.  */
+
+  switch (ch)
+    {
+    default:
+      append (ch);
+      break;
+
+#define BEGIN_SRC(SRC_CODEPOINT)		\
+    case SRC_CODEPOINT:			\
+      {
+
+#define DST(CODEPOINT) \
+	do { append_decomposition (CODEPOINT); } while (0)
+
+#define END_SRC					\
+      }						\
+      break;
+
+#include "decomposition.inc"
+
+#undef BEGIN_SRC
+#undef DST
+#undef END_SRC
+    }
+}
+
+/* Return the string that results from converting this string to NFD.  */
+
+utf32_string
+utf32_string::convert_to_nfd () const
+{
+  /* Two-pass algorithm.
+     FIXME: Should we convert it to a single-pass algorithm?  */
+  utf32_string result (m_len);
+
+  /* First pass: replace each canonical composite with
+     its canonical decomposition.  */
+  for (size_t idx = 0; idx < m_len; idx++)
+    {
+      const cppchar_t ch = m_buf[idx];
+      result.append_decomposition (ch);
+    }
+
+  /* Second pass: sort sequences of combining marks by
+     combining class.  */
+  for (size_t start_idx = 0; start_idx < result.get_length (); )
+    {
+      if (cpp_combining_class (result[start_idx]) == 0)
+	start_idx++;
+      else
+	{
+	  /* Find run of followup chars that also have a nonzero
+	     combining class.
+	     Specifically, beyond_idx will be the first index after
+	     such a run.  */
+	  size_t beyond_idx;
+	  for (beyond_idx = start_idx + 1;
+	       (beyond_idx < result.get_length ()
+		&& cpp_combining_class (result[beyond_idx]) != 0);
+	       beyond_idx++)
+	    {
+	      /* Empty.  */
+	    }
+	  /* Sort this run of combining marks.  */
+	  if (beyond_idx > start_idx + 1)
+	    result.sort_by_combining_class (start_idx, beyond_idx);
+	  start_idx = beyond_idx;
+	}
+    }
+
+  return result;
+}
+
+/* Comparator callback for sorting cppchar_t by combining class.  */
+
+static int
+cmp_combining_class (const void *p1, const void *p2)
+{
+  cppchar_t c1 = *(const cppchar_t *)p1;
+  cppchar_t c2 = *(const cppchar_t *)p2;
+  return ((int)cpp_combining_class (c1)) - ((int)cpp_combining_class (c2));
+}
+
+/* Subroutine of utf32_string::convert_to_nfd: Sort the characters
+   in the half-open range [START_IDX, BEYOND_IDX) by combining class.  */
+
+void
+utf32_string::sort_by_combining_class (size_t start_idx, size_t beyond_idx)
+{
+  gcc_stablesort (m_buf + start_idx, beyond_idx - start_idx,
+		  sizeof (cppchar_t), cmp_combining_class);
+}
+
+/* Subroutine of get_tr39_skeleton, for implementing step 2
+   of skeleton(X) from
+   http://www.unicode.org/reports/tr39/#Confusable_Detection  */
+
+static utf32_string
+convert_homoglyphs_to_exemplars (const utf32_string &in_str)
+{
+  utf32_string result (in_str.get_length ());
+  for (size_t idx = 0; idx < in_str.get_length (); idx++)
+    {
+      const cppchar_t ch = in_str[idx];
+      switch (ch)
+	{
+	default:
+	  result.append (ch);
+	  break;
+
+#define BEGIN_SRC(SRC_CODEPOINT)		\
+	  case SRC_CODEPOINT:			\
+	    {
+
+#define DST(CODEPOINT) \
+	  do { result.append (CODEPOINT); } while (0)
+
+#define END_SRC					\
+	    }					\
+	    break;
+
+#include "confusables.inc"
+
+#undef BEGIN_SRC
+#undef DST
+#undef END_SRC
+	}
+    }
+  return result;
+}
+
+/* Implementation of skeleton(X) from
+   http://www.unicode.org/reports/tr39/#Confusable_Detection  */
+
+static utf32_string
+get_tr39_skeleton (const utf32_string &in_str)
+{
+  utf32_string result (in_str.get_length ());
+  result = in_str.convert_to_nfd ();
+  result = convert_homoglyphs_to_exemplars (result);
+  result = result.convert_to_nfd ();
+  return result;
+}
+
 /* Return an IDENTIFIER_NODE whose name is TEXT (a null-terminated string).
    If an identifier with that name has previously been referred to,
    the same node is returned this time.  */
@@ -131,6 +693,179 @@  maybe_get_identifier (const char *text)
   return NULL_TREE;
 }
 
+/* Information about the first time a skeleton is used.  */
+
+struct first_use
+{
+  first_use (tree identifier, location_t loc)
+  : m_identifier (identifier), m_loc (loc)
+  {}
+
+  /* The first identifier we saw that uses the skeleton.  */
+  tree m_identifier;
+
+  /* The location of the identifier.  */
+  location_t m_loc;
+};
+
+static hash_map<tree, tree> map_identifier_to_skeleton;
+static hash_map<tree, first_use> map_skeleton_to_first_use;
+
+/* A class for use when warning about homoglyphs: generate a copy of the
+   identifier, but escaping non-printable-ASCII bytes in the UTF-8
+   representation as \xNN.  */
+
+class escaped_identifier
+{
+ public:
+  escaped_identifier (tree id)
+  {
+    gcc_assert (TREE_CODE (id) == IDENTIFIER_NODE);
+    utf32_string utf32 (utf32_string::from_identifier (id));
+    for (size_t idx = 0; idx < utf32.get_length (); idx++)
+      print_escaped_codepoint (&m_pp, utf32[idx]);
+  }
+  const char *get_str ()
+  {
+    return pp_formatted_text (&m_pp);
+  }
+
+ private:
+  pretty_printer m_pp;
+};
+
+/* Called when an identifier is first created, and on every identifier
+   lookup.
+
+   Ensure that ID has a skeleton, as per
+     http://www.unicode.org/reports/tr39/#Confusable_Detection
+
+   The first time we see a new ID, generate the skeleton, and
+   if the skeleton is already in use by a different ID, issue
+   a -Whomoglyph diagnostic.  */
+
+static void
+maybe_warn_on_homoglyph (tree id)
+{
+  if (!warn_homoglyph)
+    return;
+
+  /* If we've already got the skeleton for ID, bail out.
+     This ensures that we only warn for the first occurence
+     of the identifier.  */
+  /* FIXME: this is likely to be slow; is there somewhere we can stash
+     this in the identifier itself?  */
+  if (map_identifier_to_skeleton.get (id))
+    return;
+
+  utf32_string str
+    (utf32_string::from_utf8 ((const unsigned char *)IDENTIFIER_POINTER (id),
+			      IDENTIFIER_LENGTH (id)));
+  utf32_string skel = get_tr39_skeleton (str);
+  utf8_string utf8_skel (skel);
+
+  tree skel_id;
+
+  /* The common case is that SKELETON(ID) is ID.  */
+  if (utf8_skel.get_length () == IDENTIFIER_LENGTH (id)
+      && memcmp (IDENTIFIER_POINTER (id),
+		 utf8_skel.get_buffer (),
+		 utf8_skel.get_length ()) == 0)
+    skel_id = id;
+  else
+    {
+      /* Otherwise, recursively look up the identifier for the skeleton.
+	 Get an identifier for the skeleton, but without checking
+	 for homoglyphs (avoiding recursion issues).  */
+      ident_hash->on_new_node = NULL;
+      ident_hash->on_existing_node = NULL;
+      skel_id = HT_IDENT_TO_GCC_IDENT (ht_lookup (ident_hash,
+						  utf8_skel.get_buffer (),
+						  utf8_skel.get_length (),
+						  HT_ALLOC));
+      ident_hash->on_new_node = stringpool_on_new_hashnode;
+      ident_hash->on_existing_node = stringpool_on_existing_hashnode;
+    }
+  map_identifier_to_skeleton.put (id, skel_id);
+  if (first_use *slot = map_skeleton_to_first_use.get (skel_id))
+    {
+      /* This skeleton has already been used by a diferent identifier;
+	 issue a diagnostic.
+
+	 If we simply print both identifiers, the resulting diagnostics
+	 are themselved confusing, as they are visually identical in
+	 the diagnostics.  Hence we also print escaped versions of the
+	 identifiers in the diagnostics for the cases where the identifier
+	 is non-equal to the skeleton (which could be one or both of them).  */
+
+      auto_diagnostic_group d;
+
+      /* CWE-1007: "Insufficient Visual Distinction of Homoglyphs
+	 Presented to User" */
+      diagnostic_metadata m;
+      m.add_cwe (1007);
+
+      gcc_rich_location richloc (input_location);
+      richloc.set_escape_on_output (true);
+      bool warned;
+      if (id == skel_id)
+	{
+	  warned = warning_meta (&richloc, m, OPT_Whomoglyph,
+				 "identifier %qs...",
+				 IDENTIFIER_POINTER (id));
+	}
+      else
+	{
+	  escaped_identifier escaped_id (id);
+	  warned = warning_meta (&richloc, m, OPT_Whomoglyph,
+				 "identifier %qs (%qs)...",
+				 IDENTIFIER_POINTER (id),
+				 escaped_id.get_str ());
+	}
+      if (warned)
+	{
+	  gcc_rich_location slot_richloc (slot->m_loc);
+	  slot_richloc.set_escape_on_output (true);
+	  if (slot->m_identifier == skel_id)
+	    {
+	      inform (&slot_richloc,
+		      "...confusable with non-equal identifier %qs here",
+		      IDENTIFIER_POINTER (slot->m_identifier));
+	    }
+	  else
+	    {
+	      escaped_identifier escaped_first_use (slot->m_identifier);
+	      inform (&slot_richloc,
+		      "...confusable with non-equal identifier %qs (%qs) here",
+		      IDENTIFIER_POINTER (slot->m_identifier),
+		      escaped_first_use.get_str ());
+	    }
+	}
+    }
+  else
+    {
+      /* Otherwise, this is the first use of this skeleton.  */
+      map_skeleton_to_first_use.put (skel_id, first_use (id, input_location));
+    }
+}
+
+/* Callback for handling insertions into the identifier hashtable.  */
+
+static void stringpool_on_new_hashnode (hashnode ht_node)
+{
+  tree t = HT_IDENT_TO_GCC_IDENT (ht_node);
+  maybe_warn_on_homoglyph (t);
+}
+
+/* Callback for handling reuse of identifiers within the
+   identifier hashtable.  */
+
+static void stringpool_on_existing_hashnode (hashnode ht_node)
+{
+  tree t = HT_IDENT_TO_GCC_IDENT (ht_node);
+  maybe_warn_on_homoglyph (t);
+}
+
 /* Report some basic statistics about the string pool.  */
 
 void
@@ -277,3 +1012,467 @@  gt_pch_restore_stringpool (void)
 }
 
 #include "gt-stringpool.h"
+
+#if CHECKING_P
+
+namespace selftest {
+
+/* Implementation detail of ASSERT_DUMP_EQ.  */
+
+static void
+assert_dump_eq (const location &loc,
+		const utf32_string &str,
+		const char *expected)
+{
+  pretty_printer pp;
+  str.dump_to_pp (&pp);
+  ASSERT_STREQ_AT (loc, pp_formatted_text (&pp), expected);
+}
+
+static void
+assert_dump_eq (const location &loc,
+		const utf8_string &str,
+		const char *expected)
+{
+  pretty_printer pp;
+  str.dump_to_pp (&pp);
+  ASSERT_STREQ_AT (loc, pp_formatted_text (&pp), expected);
+}
+
+/* Assert that STR.dump_to_pp () is EXPECTED.  */
+
+#define ASSERT_DUMP_EQ(STR, EXPECTED) \
+  SELFTEST_BEGIN_STMT							\
+  assert_dump_eq ((SELFTEST_LOCATION), (STR), (EXPECTED)); \
+  SELFTEST_END_STMT
+
+/* Verify that utf32_string::from_utf8 works as expected.
+   Also verify that utf32_string::dump_to_pp works.
+   Also verify that utf32 to utf8 works.  */
+
+static void
+test_utf32_from_utf8 (void)
+{
+  /* Empty str.  */
+  {
+    utf32_string s (utf32_string::from_utf8 (""));
+    ASSERT_EQ (s.get_length (), 0);
+    ASSERT_DUMP_EQ(s, "\"\"");
+
+    utf8_string s8 (s);
+    ASSERT_DUMP_EQ(s8, "\"\"");
+  }
+
+  /* Pure ASCII.  */
+  {
+    utf32_string s (utf32_string::from_utf8 ("hello world"));
+    ASSERT_EQ (s.get_length (), 11);
+    ASSERT_EQ (s[0], 'h');
+    ASSERT_EQ (s[1], 'e');
+    ASSERT_EQ (s[2], 'l');
+    ASSERT_EQ (s[3], 'l');
+    ASSERT_EQ (s[4], 'o');
+    ASSERT_EQ (s[5], ' ');
+    ASSERT_EQ (s[6], 'w');
+    ASSERT_EQ (s[7], 'o');
+    ASSERT_EQ (s[8], 'r');
+    ASSERT_EQ (s[9], 'l');
+    ASSERT_EQ (s[10], 'd');
+    ASSERT_DUMP_EQ(s, "\"hello world\"");
+
+    utf8_string s8 (s);
+    ASSERT_DUMP_EQ(s8, "\"hello world\"");
+  }
+
+  /* 2 bytes per char: a string embedding U+03C0 GREEK SMALL LETTER PI
+     which has UTF-8 encoding: 0xCF 0x80.  */
+  {
+    const char *utf8 = "Happy \xcf\x80 day!";
+    utf32_string s (utf32_string::from_utf8 (utf8));
+    ASSERT_EQ (s.get_length (), 12);
+    ASSERT_EQ (s[0], 'H');
+    ASSERT_EQ (s[1], 'a');
+    ASSERT_EQ (s[2], 'p');
+    ASSERT_EQ (s[3], 'p');
+    ASSERT_EQ (s[4], 'y');
+    ASSERT_EQ (s[5], ' ');
+    ASSERT_EQ (s[6], 0x3c0);
+    ASSERT_EQ (s[7], ' ');
+    ASSERT_EQ (s[8], 'd');
+    ASSERT_EQ (s[9], 'a');
+    ASSERT_EQ (s[10], 'y');
+    ASSERT_EQ (s[11], '!');
+    ASSERT_DUMP_EQ(s, "\"Happy \\u03c0 day!\"");
+
+    utf8_string s8 (s);
+    ASSERT_DUMP_EQ(s8, "\"Happy \\xcf\\x80 day!\"");
+  }
+
+  /* 3 bytes per char: the Japanese word "mojibake", written as the
+     4 codepoints:
+       U+6587 CJK UNIFIED IDEOGRAPH-6587
+       U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+       U+5316 CJK UNIFIED IDEOGRAPH-5316
+       U+3051 HIRAGANA LETTER KE.  */
+  {
+    const char *utf8 = (/* U+6587 CJK UNIFIED IDEOGRAPH-6587
+			   UTF-8: 0xE6 0x96 0x87
+			   C octal escaped UTF-8: \346\226\207.  */
+			"\346\226\207"
+
+			/* U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+			   UTF-8: 0xE5 0xAD 0x97
+			   C octal escaped UTF-8: \345\255\227.  */
+			"\345\255\227"
+
+			/* U+5316 CJK UNIFIED IDEOGRAPH-5316
+			   UTF-8: 0xE5 0x8C 0x96
+			   C octal escaped UTF-8: \345\214\226.  */
+			 "\345\214\226"
+
+			 /* U+3051 HIRAGANA LETTER KE
+			      UTF-8: 0xE3 0x81 0x91
+			      C octal escaped UTF-8: \343\201\221.  */
+			"\343\201\221");
+    utf32_string s (utf32_string::from_utf8 (utf8));
+    ASSERT_EQ (s.get_length (), 4);
+    ASSERT_EQ (s[0], 0x6587);
+    ASSERT_EQ (s[1], 0x5b57);
+    ASSERT_EQ (s[2], 0x5316);
+    ASSERT_EQ (s[3], 0x3051);
+    ASSERT_DUMP_EQ(s, "\"\\u6587\\u5b57\\u5316\\u3051\"");
+
+    utf8_string s8 (s);
+    ASSERT_DUMP_EQ(s8, ("\""
+			"\\xe6\\x96\\x87"
+			"\\xe5\\xad\\x97"
+			"\\xe5\\x8c\\x96"
+			"\\xe3\\x81\\x91"
+			"\""));
+  }
+
+  /* 4 bytes per char: the emoji U+1F602 FACE WITH TEARS OF JOY,
+     which has UTF-8 encoding "\xf0\x9f\x98\x82".  */
+  {
+    const char *utf8 = "pre \xf0\x9f\x98\x82 post";
+    utf32_string s (utf32_string::from_utf8 (utf8));
+    ASSERT_EQ (s.get_length (), 10);
+    ASSERT_EQ (s[0], 'p');
+    ASSERT_EQ (s[1], 'r');
+    ASSERT_EQ (s[2], 'e');
+    ASSERT_EQ (s[3], ' ');
+    ASSERT_EQ (s[4], 0x1f602);
+    ASSERT_EQ (s[5], ' ');
+    ASSERT_EQ (s[6], 'p');
+    ASSERT_EQ (s[7], 'o');
+    ASSERT_EQ (s[8], 's');
+    ASSERT_EQ (s[9], 't');
+    ASSERT_DUMP_EQ(s, "\"pre \\U0001f602 post\"");
+
+    utf8_string s8 (s);
+    ASSERT_DUMP_EQ(s8, ("\"pre \\xf0\\x9f\\x98\\x82 post\""));
+  }
+}
+
+/* Verify that utf32_string::from_hex works as expected.  */
+
+static void
+test_from_hex ()
+{
+  const char *buf = "11935 0334 11930";
+  utf32_string result (utf32_string::from_hex (buf, strlen (buf)));
+  ASSERT_EQ (result.get_length (), 3);
+  ASSERT_DUMP_EQ (result, "\"\\U00011935\\u0334\\U00011930\"");
+}
+
+/* Verify that cpp_combining_class works as expected.  */
+
+static void
+test_combining_classes ()
+{
+  /* LATIN CAPITAL LETTER A.  */
+  ASSERT_EQ (cpp_combining_class (0x41), 0);
+
+  /* COMBINING ACUTE ACCENT.  */
+  ASSERT_EQ (cpp_combining_class (0x301), 230);
+
+  /* COMBINING CEDILLA.  */
+  ASSERT_EQ (cpp_combining_class (0x327), 202);
+}
+
+/* Implementation detail of ASSERT_UTF32_EQ_AT.  */
+
+static void
+assert_utf32_eq_at (const location &code_loc,
+		    const location &data_loc,
+		    const utf32_string &str1,
+		    const utf32_string &str2,
+		    const char *desc1,
+		    const char *desc2)
+{
+  if (str1 == str2)
+    return;
+  ::selftest::note_formatted (data_loc,
+			      "ASSERT_UTF32_EQ_AT: on this data line");
+  fprintf (stderr, "str1: ");
+  str1.dump ();
+  fprintf (stderr, "str2: ");
+  str2.dump ();
+  ::selftest::fail_formatted (code_loc,
+			      "ASSERT_UTF32_EQ_AT: str1: %s, str2: %s",
+			      desc1, desc2);
+}
+
+/* Assert that STR1 equals STR2, showing both SELFTEST_LOCATION, DATA_LOC,
+   and a dump of the strings on failure.  */
+
+#define ASSERT_UTF32_EQ_AT(DATA_LOC, STR1, STR2)		\
+  SELFTEST_BEGIN_STMT							\
+  assert_utf32_eq_at ((SELFTEST_LOCATION), (DATA_LOC),			\
+		      (STR1), (STR2), (#STR1), (#STR2));		\
+  SELFTEST_END_STMT
+
+/* Handle one line from
+     https://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
+   where PATH is the path to the file,
+   LINE_NUM is the 1-based line number within the file.
+
+   Verify that utf32_string::convert_to_nfd works as expected.  */
+
+static void
+test_normalization_line (const char *path, unsigned line_num,
+			 const char *line_start, size_t line_len)
+{
+  /* Update LINE_LEN to ignore the first hash character,
+     and anything after it.  */
+  if (const char *hash = (const char *)memchr (line_start, '#', line_len))
+    line_len = hash - line_start;
+
+  if (line_len == 0)
+    return;
+
+  /* Ignore the "@PartN" lines.  */
+  if (line_start[0] == '@')
+    return;
+
+  location data_loc (path, line_num, __FUNCTION__);
+
+  /* Locate columns divided by semicolumn characters.  */
+  auto_vec <std::pair <const char *, size_t>> columns (5);
+  while (const char *sep = (const char *)memchr (line_start, ';', line_len))
+    {
+      size_t column_width = sep - line_start;
+      columns.safe_push (std::pair <const char *, size_t> (line_start,
+							   column_width));
+      line_start = sep + 1;
+      line_len -= column_width + 1;
+    }
+
+  /* We expect each line to have 5 columns.  */
+  ASSERT_EQ (columns.length (), 5);
+
+  utf32_string col1 (utf32_string::from_hex (columns[0].first,
+					     columns[0].second));
+  utf32_string col2 (utf32_string::from_hex (columns[1].first,
+					     columns[1].second));
+  utf32_string col3 (utf32_string::from_hex (columns[2].first,
+					     columns[2].second));
+  utf32_string col4 (utf32_string::from_hex (columns[3].first,
+					     columns[3].second));
+  utf32_string col5 (utf32_string::from_hex (columns[4].first,
+					     columns[4].second));
+
+  /* Verify NFD.  */
+  /* "c3 ==  toNFD(c1) ==  toNFD(c2) ==  toNFD(c3)". */
+  ASSERT_UTF32_EQ_AT (data_loc, col3, col1.convert_to_nfd ());
+  ASSERT_UTF32_EQ_AT (data_loc, col3, col2.convert_to_nfd ());
+  ASSERT_UTF32_EQ_AT (data_loc, col3, col3.convert_to_nfd ());
+  /* "c5 ==  toNFD(c4) ==  toNFD(c5)". */
+  ASSERT_UTF32_EQ_AT (data_loc, col5, col4.convert_to_nfd ());
+  ASSERT_UTF32_EQ_AT (data_loc, col5, col5.convert_to_nfd ());
+}
+
+/* Call test_normalization_line on each line of a copy of the
+   https://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
+   conformance test.  */
+
+static void
+test_normalization ()
+{
+  char *path = locate_file ("NormalizationTest.txt");
+  char *data = read_file (SELFTEST_LOCATION, path);
+
+  const char *line_start = data;
+  unsigned line_num = 1;
+  while (1)
+    {
+      const char *line_end = strchr (line_start, '\n');
+      if (line_end == NULL)
+	break;
+      test_normalization_line (path, line_num,
+			       line_start, line_end - line_start);
+      line_start = line_end + 1;
+      line_num++;
+    }
+
+  free (data);
+  free (path);
+}
+
+/* Verify that get_tr39_skeleton works as expected.  */
+
+static void
+test_tr39_skeleton_1 ()
+{
+  utf32_string s1 (utf32_string::from_utf8 ("sayHello"));
+  ASSERT_EQ (s1.get_length (), 8);
+  ASSERT_EQ (s1[3], 'H');
+  ASSERT_DUMP_EQ(s1, "\"sayHello\"");
+  ASSERT_EQ (s1, s1);
+
+  utf32_string skel1 = get_tr39_skeleton (s1);
+
+  /* s1 should map to itself.  */
+  ASSERT_EQ (s1, skel1);
+
+  /* "CYRILLIC CAPITAL LETTER EN" (U+041D), with UTF-8 encoding 0xD0 0x9D,
+     as opposed to ASCII 'H' (U+0048).  */
+  utf32_string s2 (utf32_string::from_utf8 ("say" "\xd0\x9d" "ello"));
+  ASSERT_EQ (s2.get_length (), 8);
+  ASSERT_EQ (s2[3], 0x41d);
+  ASSERT_DUMP_EQ(s2, "\"say\\u041dello\"");
+  ASSERT_EQ (s2, s2);
+
+  ASSERT_NE (s1, s2);
+
+  utf32_string skel2 = get_tr39_skeleton (s2);
+  ASSERT_DUMP_EQ(skel2, "\"sayHello\"");
+  ASSERT_NE (s2, skel2);
+
+  /* Both should map to the same skeleton.  */
+  ASSERT_EQ (skel1, skel2);
+}
+
+/* Example of a single codepoint mapping to multiple codepoints.  */
+
+static void
+test_tr39_skeleton_2 ()
+{
+  utf32_string s1 (utf32_string::from_utf8 ("(v)"));
+  ASSERT_EQ (s1.get_length (), 3);
+  ASSERT_DUMP_EQ(s1, "\"(v)\"");
+
+  utf32_string skel1 = get_tr39_skeleton (s1);
+
+  /* s1 should map to itself.  */
+  ASSERT_EQ (s1, skel1);
+
+  /* "PARENTHESIZED LATIN SMALL LETTER V" (U+24B1) should map to
+     LEFT PARENTHESIS, LATIN SMALL LETTER V, RIGHT PARENTHESIS.  */
+  utf32_string s2 (utf32_string::from_cppchar_t (0x24b1));
+  ASSERT_DUMP_EQ(s2, "\"\\u24b1\"");
+  utf32_string skel2 = get_tr39_skeleton (s2);
+  ASSERT_DUMP_EQ(skel2, "\"(v)\"");
+
+  /* Both should map to the same skeleton.  */
+  ASSERT_EQ (skel1, skel2);
+}
+
+/* Both LATIN CAPITAL LETTER H WITH DESCENDER (U+2C67)
+   and CYRILLIC CAPITAL LETTER EN WITH DESCENDER (U+04A2)
+   should map to LATIN CAPITAL LETTER H, COMBINING VERTICAL LINE BELOW
+   (U+0048, U+0329).  */
+
+static void
+test_tr39_skeleton_3 ()
+{
+  utf32_string s1 (utf32_string::from_cppchar_t (0x2c67));
+  ASSERT_DUMP_EQ(s1, "\"\\u2c67\"");
+
+  utf32_string s2 (utf32_string::from_cppchar_t (0x04a2));
+  ASSERT_DUMP_EQ(s2, "\"\\u04a2\"");
+
+  utf32_string skel1 = get_tr39_skeleton (s1);
+  ASSERT_NE (s1, skel1);
+
+  utf32_string skel2 = get_tr39_skeleton (s2);
+  ASSERT_NE (s2, skel2);
+
+  /* Both should map to the same skeleton.  */
+  ASSERT_EQ (skel1, skel2);
+  ASSERT_DUMP_EQ(skel2, "\"H\\u0329\"");
+}
+
+static void
+test_tr39_skeleton_4 ()
+{
+  /* U+0124: LATIN CAPITAL LETTER H WITH CIRCUMFLEX
+     has canonical decomposition 0048 0302.
+     This verifies that we apply NFD to the skeleton.  */
+  utf32_string s1 (utf32_string::from_cppchar_t (0x0124));
+  ASSERT_DUMP_EQ(s1, "\"\\u0124\"");
+
+  /* "CYRILLIC CAPITAL LETTER EN" (U+041D) as opposed to ASCII 'H' (U+0048),
+     followed by COMBINING CIRCUMFLEX ACCENT (U+0302).
+     This is not in NFC form.  */
+  utf32_string s2 (2);
+  s2.append (0x041d);
+  s2.append (0x0302);
+  ASSERT_DUMP_EQ(s2, "\"\\u041d\\u0302\"");
+
+  utf32_string skel1 = get_tr39_skeleton (s1);
+  ASSERT_NE (s1, skel1);
+
+  utf32_string skel2 = get_tr39_skeleton (s2);
+  ASSERT_NE (s2, skel2);
+
+  /* Both should map to the same skeleton.  */
+  ASSERT_EQ (skel1, skel2);
+  ASSERT_DUMP_EQ(skel2, "\"H\\u0302\"");
+}
+
+static void
+test_tr39_skeleton_5 ()
+{
+  /* U+01C6: LATIN SMALL LETTER DZ WITH CARON
+     (has compatibility decomposition 0064 017E).  */
+  utf32_string s1 (utf32_string::from_cppchar_t (0x01c6));
+  ASSERT_DUMP_EQ(s1, "\"\\u01c6\"");
+
+  /* "LATIN SMALL LETTER D, LATIN SMALL LETTER Z, COMBINING CARON".
+     This is not in NFC form.  */
+  utf32_string s2 (3);
+  s2.append (0x0064); /* LATIN SMALL LETTER D. */
+  s2.append (0x007a); /* LATIN SMALL LETTER Z. */
+  s2.append (0x030c); /* COMBINING CARON.  */
+  ASSERT_DUMP_EQ(s2, "\"dz\\u030c\"");
+
+  utf32_string skel1 = get_tr39_skeleton (s1);
+  ASSERT_NE (s1, skel1);
+  ASSERT_DUMP_EQ(skel1, "\"dz\\u030c\"");
+
+  utf32_string skel2 = get_tr39_skeleton (s2);
+  ASSERT_NE (s2, skel2);
+  ASSERT_DUMP_EQ(skel2, "\"dz\\u0306\"");
+  ASSERT_NE (skel1, skel2);
+}
+
+/* Run all of the selftests within this file.  */
+
+void
+stringpool_c_tests ()
+{
+  test_utf32_from_utf8 ();
+  test_from_hex ();
+  test_combining_classes ();
+  test_normalization ();
+  test_tr39_skeleton_1 ();
+  test_tr39_skeleton_2 ();
+  test_tr39_skeleton_3 ();
+  test_tr39_skeleton_4 ();
+  test_tr39_skeleton_5 ();
+}
+
+} // namespace selftest
+
+#endif /* CHECKING_P */
diff --git a/gcc/testsuite/c-c++-common/Whomoglyph-1.c b/gcc/testsuite/c-c++-common/Whomoglyph-1.c
new file mode 100644
index 00000000000..9fc9d680bba
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/Whomoglyph-1.c
@@ -0,0 +1,18 @@ 
+#include <stdio.h>
+
+/* ASCII 'H' (U+0048).  */
+
+void sayHello() { /* { dg-message "\\.\\.\\.confusable with non-equal identifier 'sayHello' here" } */
+  printf("Hello, World!\n");
+}
+
+/* CYRILLIC CAPITAL LETTER EN (U+041D), with UTF-8 encoding 0xD0 0x9D.  */
+
+void sayНello() { /* { dg-warning "identifier 'sayНello' \\('say\\\\u041dello'\\)\\.\\.\\." } */
+  printf("Goodbye, World!\n");
+}
+
+int main() {
+    sayНello();
+    return 0;
+}
diff --git a/gcc/testsuite/c-c++-common/Whomoglyph-2.c b/gcc/testsuite/c-c++-common/Whomoglyph-2.c
new file mode 100644
index 00000000000..0105cd19158
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/Whomoglyph-2.c
@@ -0,0 +1,18 @@ 
+#include <stdio.h>
+
+/* CYRILLIC CAPITAL LETTER EN (U+041D), with UTF-8 encoding 0xD0 0x9D.  */
+
+void sayНello() { /* { dg-message "\\.\\.\\.confusable with non-equal identifier 'sayНello' \\('say\\\\u041dello'\\) here" } */
+  printf("Hello, World!\n");
+}
+
+/* ASCII 'H' (U+0048).  */
+
+void sayHello() { /* { dg-warning "identifier 'sayHello'\\.\\.\\." } */
+  printf("Goodbye, World!\n");
+}
+
+int main() {
+    sayНello();
+    return 0;
+}
diff --git a/gcc/testsuite/c-c++-common/Whomoglyph-3.c b/gcc/testsuite/c-c++-common/Whomoglyph-3.c
new file mode 100644
index 00000000000..4f2c773f0eb
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/Whomoglyph-3.c
@@ -0,0 +1,20 @@ 
+/* Two confusables that both map to a third (LATIN CAPITAL LETTER H).  */
+
+#include <stdio.h>
+
+/* CYRILLIC CAPITAL LETTER EN (U+041D), with UTF-8 encoding 0xD0 0x9D.  */
+
+void sayНello() { /* { dg-message "\\.\\.\\.confusable with non-equal identifier 'sayНello' \\('say\\\\u041dello'\\) here" } */
+  printf("Hello, World!\n");
+}
+
+/* GREEK CAPITAL LETTER ETA (U+0397).  */
+
+void sayΗello() { /* { dg-warning "identifier 'sayΗello' \\('say\\\\u0397ello'\\)\\.\\.\\." } */
+  printf("Goodbye, World!\n");
+}
+
+int main() {
+    sayНello();
+    return 0;
+}
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 0b0ccc6c021..894c43d102d 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -921,6 +921,29 @@  struct ucnrange {
 /* ISO 10646 defines the UCS codespace as the range 0-0x10FFFF inclusive.  */
 #define UCS_LIMIT 0x10FFFF
 
+/* Get the combining class for C.  */
+
+unsigned char
+cpp_combining_class (cppchar_t c)
+{
+  int mn, mx, md;
+
+  if (c > UCS_LIMIT)
+    return 0;
+
+  mn = 0;
+  mx = ARRAY_SIZE (ucnranges) - 1;
+  while (mx != mn)
+    {
+      md = (mn + mx) / 2;
+      if (c <= ucnranges[md].end)
+	mx = md;
+      else
+	mn = md + 1;
+    }
+  return ucnranges[mn].combine;
+}
+
 /* Returns 1 if C is valid in an identifier, 2 if C is valid except at
    the start of an identifier, and 0 if C is not valid in an
    identifier.  We assume C has already gone through the checks of
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 176f8c5bbce..0f31115c2d7 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -1267,6 +1267,9 @@  extern bool cpp_pedwarning (cpp_reader *, enum cpp_warning_reason,
 extern bool cpp_warning_syshdr (cpp_reader *, enum cpp_warning_reason reason,
 				const char *msgid, ...)
   ATTRIBUTE_PRINTF_3;
+extern bool cpp_warning_at (cpp_reader *, enum cpp_warning_reason,
+			    rich_location *richloc, const char *msgid, ...)
+  ATTRIBUTE_PRINTF_4;
 
 /* As their counterparts above, but use RICHLOC.  */
 extern bool cpp_warning_at (cpp_reader *, enum cpp_warning_reason,
@@ -1544,4 +1547,6 @@  int cpp_wcwidth (cppchar_t c);
 bool cpp_input_conversion_is_trivial (const char *input_charset);
 int cpp_check_utf8_bom (const char *data, size_t data_length);
 
+unsigned char cpp_combining_class (cppchar_t ch);
+
 #endif /* ! LIBCPP_CPPLIB_H */
diff --git a/libcpp/include/symtab.h b/libcpp/include/symtab.h
index 6905753f839..c5be0a3474a 100644
--- a/libcpp/include/symtab.h
+++ b/libcpp/include/symtab.h
@@ -54,6 +54,10 @@  struct ht
   /* Call back, allocate something that hangs off a node like a cpp_macro.  
      NULL means use the usual allocator.  */
   void * (*alloc_subobject) (size_t);
+  /* Call back, a new node has been allocated.  */
+  void (*on_new_node) (hashnode);
+  /* Call back, an existing node has been reused.  */
+  void (*on_existing_node) (hashnode);
 
   unsigned int nslots;		/* Total slots in the entries array.  */
   unsigned int nelements;	/* Number of live elements.  */
diff --git a/libcpp/symtab.c b/libcpp/symtab.c
index 9a2fae0f78d..b33fed3ef1e 100644
--- a/libcpp/symtab.c
+++ b/libcpp/symtab.c
@@ -119,7 +119,11 @@  ht_lookup_with_hash (cpp_hash_table *table, const unsigned char *str,
       else if (node->hash_value == hash
 	       && HT_LEN (node) == (unsigned int) len
 	       && !memcmp (HT_STR (node), str, len))
-	return node;
+	{
+	  if (table->on_existing_node)
+	    (*table->on_existing_node) (node);
+	  return node;
+	}
 
       /* hash2 must be odd, so we're guaranteed to visit every possible
 	 location in the table during rehashing.  */
@@ -141,7 +145,11 @@  ht_lookup_with_hash (cpp_hash_table *table, const unsigned char *str,
 	  else if (node->hash_value == hash
 		   && HT_LEN (node) == (unsigned int) len
 		   && !memcmp (HT_STR (node), str, len))
-	    return node;
+	    {
+	      if (table->on_existing_node)
+		(*table->on_existing_node) (node);
+	      return node;
+	    }
 	}
     }
 
@@ -173,6 +181,9 @@  ht_lookup_with_hash (cpp_hash_table *table, const unsigned char *str,
     /* Must expand the string table.  */
     ht_expand (table);
 
+  if (table->on_new_node)
+    (*table->on_new_node) (node);
+
   return node;
 }