From patchwork Thu Oct 24 16:39:51 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Simon Marchi (Code Review)"
 <gerrit@gnutoolchain-gerrit.osci.io>
X-Patchwork-Id: 35280
Received: (qmail 110916 invoked by alias); 24 Oct 2019 16:39:58 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Unsubscribe: <mailto:gdb-patches-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Delivered-To: mailing list gdb-patches@sourceware.org
Received: (qmail 110908 invoked by uid 89); 24 Oct 2019 16:39:57 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-20.6 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	KAM_SHORT autolearn=ham version=3.3.1 spammy=*c, *.c, intends,
	dictionary
X-HELO: mx1.osci.io
Received: from polly.osci.io (HELO mx1.osci.io) (8.43.85.229) by
	sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Thu, 24 Oct 2019 16:39:56 +0000
Received: by mx1.osci.io (Postfix, from userid 994)	id BB43A204A7;
	Thu, 24 Oct 2019 12:39:54 -0400 (EDT)
Received: from gnutoolchain-gerrit.osci.io (gnutoolchain-gerrit.osci.io
	[8.43.85.239])	by mx1.osci.io (Postfix) with ESMTP id
	5E568202DF	for <gdb-patches@sourceware.org>;
	Thu, 24 Oct 2019 12:39:51 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])	by
	gnutoolchain-gerrit.osci.io (Postfix) with ESMTP id
	362A3204C9	for <gdb-patches@sourceware.org>;
	Thu, 24 Oct 2019 12:39:51 -0400 (EDT)
X-Gerrit-PatchSet: 1
Date: Thu, 24 Oct 2019 12:39:51 -0400
From: "Tom de Vries (Code Review)" <gerrit@gnutoolchain-gerrit.osci.io>
To: gdb-patches@sourceware.org
Message-ID: 
 <gerrit.1571935190000.I7b119c9a4519cdbf62a3243d1df2927c80813e8b@gnutoolchain-gerrit.osci.io>
Auto-Submitted: auto-generated
X-Gerrit-MessageType: newchange
Subject: [review] [RFC][gdb/contrib] Add words.sh script
X-Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
X-Gerrit-Change-Number: 282
X-Gerrit-ChangeURL: 
 <https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282>
X-Gerrit-Commit: d24c519ae276e163daf80d601cdb3c329e225c36
References: 
 <gerrit.1571935190000.I7b119c9a4519cdbf62a3243d1df2927c80813e8b@gnutoolchain-gerrit.osci.io>
Reply-To: tdevries@suse.de, gdb-patches@sourceware.org
MIME-Version: 1.0
Content-Disposition: inline
User-Agent: Gerrit/3.0.3

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................

[RFC][gdb/contrib] Add words.sh script

Add a script that takes a list of files as arguments and output a list of
words from the C comments with their frequencies.

For:
...
$ ./gdb/contrib/words.sh $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~15000 words prefixed with frequency.

This could be used to generate a dictionary that is kept as part of the
sources, against which new code can be checked, generating a warning or
error.  The hope is that misspellings would trigger this frequently, and rare
words rarely, otherwise the burden of updating the dictionary would be too
much.

And for:
...
$ ./gdb/contrib/words.sh -f 1 $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~5000 words with frequency 1.

This can be used to scan for misspellings manually.

Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
---
A gdb/contrib/words.sh
1 file changed, 107 insertions(+), 0 deletions(-)

diff --git a/gdb/contrib/words.sh b/gdb/contrib/words.sh
new file mode 100755
index 0000000..ad6ec2b
--- /dev/null
+++ b/gdb/contrib/words.sh
@@ -0,0 +1,107 @@
+#!/bin/sh
+
+# Copyright (C) 2019 Free Software Foundation, Inc.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This script intends to facilitate spell checking of comments in C sources.
+# It:
+# - extracts comments from C files
+# - transforms the comments into a list of lowercase words
+# - prefixes each word with the frequency
+# - filters out words within a frequency range
+# - sorts the words, longest first
+
+dir=$(cd $(dirname $0); pwd -P)
+
+minfreq=
+maxfreq=
+while [ $# -gt 0 ]; do
+    case "$1" in
+	--freq|-f)
+	    minfreq=$2
+	    maxfreq=$2
+	    shift 2
+	    ;;
+	--min)
+	    minfreq=$2
+	    if [ "$maxfreq" = "" ]; then
+		maxfreq=0
+	    fi
+	    shift 2
+	    ;;
+	--max)
+	    maxfreq=$2
+	    if [ "$minfreq" = "" ]; then
+		minfreq=0
+	    fi
+	    shift 2
+	    ;;
+	*)
+	    break;
+	    ;;
+    esac
+done
+
+if [ "$minfreq" = "" ] && [ "$maxfreq" = "" ]; then
+    minfreq=0
+    maxfreq=0
+fi
+
+awkfile=$(mktemp)
+
+cat > $awkfile <<EOF
+BEGIN {
+    in_comment=0
+}
+
+// {
+    line=\$0
+}
+
+/\/\*/ {
+    in_comment=1
+    sub(/.*\/\*/, "", line)
+}
+
+/\*\// {
+    sub(/\*\/.*/, "", line)
+    in_comment=0
+    print line
+    next
+}
+
+// {
+    if (in_comment) {
+	print line
+    }
+}
+EOF
+
+awk \
+    -f $awkfile \
+    "$@" \
+    | sed 's/[%^$~#{}`&=@,. \t\/_-()|<>\+\*]/\n/g' \
+    | sed 's/\[/\n/g' \
+    | sed 's/\]/\n/g' \
+    | sed 's/[0-9][0-9]*/\n/g' \
+    | tr '[:upper:]' '[:lower:]' \
+    | sed 's/[ \t]*//g' \
+    | sort \
+    | uniq -c \
+    | awk "{ if (($minfreq == 0 || $minfreq <= \$1) && ($maxfreq == 0 || \$1 <= $maxfreq)) { print \$0; } }" \
+    | awk '{ print length($0) " " $0; }' \
+    | sort -n -r \
+    | cut -d ' ' -f 2-
+
+rm -f $awkfile