wwwdocs: experiments with a Python postprocessing script
Checks
Commit Message
The heading elements in our website contain "id" information,
but currently to find them you to look at the page source,
whereas in the generated HTML for the manual we have e.g.:
<a class="copiable-link" href="#index-mabi-1"> ¶</a>
which shows up nicely in the browser in e.g.
https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
as a pilcrow character when you hover over the link, which
you can then use to copy the URL to the clipboard.
It's *very* helpful to have easily shareable links to within pages.
The attached patch adds a postprocessing step to "bin" that
turns e.g.
<h1 id="ID">TEXT</h1>
to:
<h1 id="ID"><a href="#ID">TEXT</a></h1>
which makes it very easy to copy links in the generated website.
I didn't bother adding any CSS.
I've never managed to build MetaHTML and have always just
crossed my fingers and hoped when making edits to the GCC
website; bin/preprocess just errors out for me immediately
due to not finding mhc.
So this patch as written replaces the invocation of mhc with
an invocation of the python script, which of course drops
various features.
I've uploaded a build of the website with this to:
https://dmalcolm.fedorapeople.org/gcc/2025-01-15/htdocs/
You can see e.g. the easily clickable heading ids here:
https://dmalcolm.fedorapeople.org/gcc/2025-01-15/htdocs/gcc-15/changes.html
compared to:
https://gcc.gnu.org/gcc-15/porting_to.html
and, for now, the loss of the mhc stuff here:
https://dmalcolm.fedorapeople.org/gcc/2025-01-15/htdocs/
compared to:
https://gcc.gnu.org/
Gerald: if you have mhc working, can you please try adjusting the
bin/ so it runs *both*. mhc and the python script.
Thoughts?
Dave
---
bin/preprocess | 13 +++----------
bin/process_html.py | 32 ++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+), 10 deletions(-)
create mode 100644 bin/process_html.py
Comments
On Wed, 15 Jan 2025, David Malcolm wrote:
> I've never managed to build MetaHTML and have always just
> crossed my fingers and hoped when making edits to the GCC
> website; bin/preprocess just errors out for me immediately
> due to not finding mhc.
I think that replacing MetaHTML with a Python script would make sense
(though I haven't reviewed the script that you posted in 2018 or
investigated what updates it might need now).
On Wed, 15 Jan 2025, David Malcolm wrote:
> The heading elements in our website contain "id" information,
> but currently to find them you to look at the page source,
> whereas in the generated HTML for the manual we have e.g.:
>
> <a class="copiable-link" href="#index-mabi-1"> ¶</a>
>
> which shows up nicely in the browser in e.g.
> https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
:
> It's *very* helpful to have easily shareable links to within pages.
Absolutely agreed.
> I've never managed to build MetaHTML and have always just crossed my
> fingers and hoped when making edits to the GCC website; bin/preprocess
> just errors out for me immediately due to not finding mhc.
Yes, sadly the GNU project let MetaHTML die (though I raised this more
than once). I still think the concept as such was fine and it served us
well over the years, but building has been challenging 20 years ago and
would require some fierce source code editing nowadays. :-(
> So this patch as written replaces the invocation of mhc with an
> invocation of the python script, which of course drops various features.
Yeah! I was hoping we could return to your script. IIRC I once pinged and
you were busy; happy to collaborate on finishing this up.
In any case, a few years ago I spent quite some time and effort to prepare
the stage, migrating the site to CSS (done) and making individual pages
self contained (also done), which removed most of the original MetaHTML
usage.
This is why things appear somewhat fine, even without MetaHTML available.
> and, for now, the loss of the mhc stuff here:
> https://dmalcolm.fedorapeople.org/gcc/2025-01-15/htdocs/
>
> compared to:
> https://gcc.gnu.org/
So it appears the two biggest losses are
(1) the default footer on every page, and
(2) the navigation bar on the main page?
Plus
(3) loss of favicon.ico on every page,
(4) postprocessing of /install docs,
(5) no longer adding "- GNU Project" to every page title.
Anything else you are aware of?
> Gerald: if you have mhc working, can you please try adjusting the
> bin/ so it runs *both*. mhc and the python script.
I have a 32-bit x86 build on a local machine which probably is 20+ years
old, plus a comparable, though not identical, build on gcc.gnu.org.
Building newly is something I tried a while ago and gave up. Not
infeasible when one patches out code we don't need to some non-standard
things, but painful and not worth it.
(I'm sorry, I'm not sure what you mean by the above, i.e., what you'd like
to see adjusted?)
> --- a/bin/preprocess
> +++ b/bin/preprocess
> @@ -33,8 +33,6 @@
> #
> # By Gerald Pfeifer <pfeifer@dbai.tuwien.ac.at> 1999-12-29.
^^^^^^^^^^
Well, talking about old code! Back then MetaHTML built fine IIRC on your
average GNU/Linux distribution. :-/
How do we best take it from there?
I believe at this point, and with MetaHTML unrecoverably dead for 10+
years, and my website rework there isn't dramatically much left we're
missing.
htdocs/style.mhtml actually tells us what, the two bigger items being (1)
default footer and (2) navigation bar on the main page as listed above.
(The BACKPATH code in style.mhtmlwas used for the GCJ main page, which is
gone now. Of course it would be nice to have navigation on every page, but
that's an enhancement, not a regression to avoid.)
With those two addressed, and (3) possibly later on, I think we should
bite the bullet, rip of the bandaid, plunge into the cold water, whatever
idiom we want to use. :-)
How can we tackle those?
Maybe some "macros", best HTML comments that insert text or include a text
file from a magic subdirectory? We could use that for the navigation
aspect.
And some "include this text before </body> in every single document"
magic for the default footer?
(Disclaimer: these are just some ideas. There may be vastly better ways.)
What do you think?
Gerald
On Thu, 2025-01-16 at 22:58 +0800, Gerald Pfeifer wrote:
> On Wed, 15 Jan 2025, David Malcolm wrote:
> > The heading elements in our website contain "id" information,
> > but currently to find them you to look at the page source,
> > whereas in the generated HTML for the manual we have e.g.:
> >
> > <a class="copiable-link" href="#index-mabi-1"> ¶</a>
> >
> > which shows up nicely in the browser in e.g.
> > https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
> :
> > It's *very* helpful to have easily shareable links to within pages.
>
> Absolutely agreed.
>
> > I've never managed to build MetaHTML and have always just crossed
> > my
> > fingers and hoped when making edits to the GCC website;
> > bin/preprocess
> > just errors out for me immediately due to not finding mhc.
>
> Yes, sadly the GNU project let MetaHTML die (though I raised this
> more
> than once). I still think the concept as such was fine and it served
> us
> well over the years, but building has been challenging 20 years ago
> and
> would require some fierce source code editing nowadays. :-(
>
> > So this patch as written replaces the invocation of mhc with an
> > invocation of the python script, which of course drops various
> > features.
>
> Yeah! I was hoping we could return to your script. IIRC I once pinged
> and
> you were busy; happy to collaborate on finishing this up.
>
> In any case, a few years ago I spent quite some time and effort to
> prepare
> the stage, migrating the site to CSS (done) and making individual
> pages
> self contained (also done), which removed most of the original
> MetaHTML
> usage.
>
> This is why things appear somewhat fine, even without MetaHTML
> available.
>
> > and, for now, the loss of the mhc stuff here:
> > https://dmalcolm.fedorapeople.org/gcc/2025-01-15/htdocs/
> >
> > compared to:
> > https://gcc.gnu.org/
>
> So it appears the two biggest losses are
> (1) the default footer on every page, and
> (2) the navigation bar on the main page?
> Plus
> (3) loss of favicon.ico on every page,
> (4) postprocessing of /install docs,
> (5) no longer adding "- GNU Project" to every page title.
>
> Anything else you are aware of?
>
> > Gerald: if you have mhc working, can you please try adjusting the
> > bin/ so it runs *both*. mhc and the python script.
>
> I have a 32-bit x86 build on a local machine which probably is 20+
> years
> old, plus a comparable, though not identical, build on gcc.gnu.org.
>
> Building newly is something I tried a while ago and gave up. Not
> infeasible when one patches out code we don't need to some non-
> standard
> things, but painful and not worth it.
>
> (I'm sorry, I'm not sure what you mean by the above, i.e., what you'd
> like
> to see adjusted?)
Sorry for being unclear.
What I mean is that I think it's possible to run *both* mhc and my
script on the input files (my script takes a file, rather than stdin,
so it can't be done directly in a shell pipeline though).
But I don't have a working mhc so I can't test that; you do, so I was
hoping you could hack up preprocess so it runs both.
Alternatively I can try to write a version of the patch that does that
(but I can't test it locally).
I'd love to get rid of metahtml, but for now I just want easily
copyable links for the gcc 15 "changes" and porting guide.
>
> > --- a/bin/preprocess
> > +++ b/bin/preprocess
> > @@ -33,8 +33,6 @@
> > #
> > # By Gerald Pfeifer <pfeifer@dbai.tuwien.ac.at> 1999-12-29.
> ^^^^^^^^^^
> Well, talking about old code! Back then MetaHTML built fine IIRC on
> your
> average GNU/Linux distribution. :-/
>
>
> How do we best take it from there?
>
> I believe at this point, and with MetaHTML unrecoverably dead for 10+
> years, and my website rework there isn't dramatically much left we're
> missing.
>
> htdocs/style.mhtml actually tells us what, the two bigger items being
> (1)
> default footer and (2) navigation bar on the main page as listed
> above.
>
> (The BACKPATH code in style.mhtmlwas used for the GCJ main page,
> which is
> gone now. Of course it would be nice to have navigation on every
> page, but
> that's an enhancement, not a regression to avoid.)
>
>
> With those two addressed, and (3) possibly later on, I think we
> should
> bite the bullet, rip of the bandaid, plunge into the cold water,
> whatever
> idiom we want to use. :-)
>
>
> How can we tackle those?
>
> Maybe some "macros", best HTML comments that insert text or include a
> text
> file from a magic subdirectory? We could use that for the navigation
> aspect.
>
> And some "include this text before </body> in every single document"
> magic for the default footer?
>
> (Disclaimer: these are just some ideas. There may be vastly better
> ways.)
If we're going that way, can we simply use a well-known Python
templating system, such as Jinja:
https://jinja.palletsprojects.com/en/stable/
or just migrate to a well-known static site generator. I looked for
ones implemented in Python and found Pelican:
https://getpelican.com/
but I'm not a web developer.
Maybe Arsen has some suggestions?
But for now, I'm hoping to get the fix for links in; please let me know
if you want me to try another version of the patch.
Dave
On Thu, 2025-01-16 at 22:58 +0800, Gerald Pfeifer wrote:
> On Wed, 15 Jan 2025, David Malcolm wrote:
> > The heading elements in our website contain "id" information,
> > but currently to find them you to look at the page source,
> > whereas in the generated HTML for the manual we have e.g.:
> >
> > <a class="copiable-link" href="#index-mabi-1"> ¶</a>
> >
> > which shows up nicely in the browser in e.g.
> > https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
> :
> > It's *very* helpful to have easily shareable links to within pages.
>
> Absolutely agreed.
>
> > I've never managed to build MetaHTML and have always just crossed
> > my
> > fingers and hoped when making edits to the GCC website;
> > bin/preprocess
> > just errors out for me immediately due to not finding mhc.
>
> Yes, sadly the GNU project let MetaHTML die (though I raised this
> more
> than once). I still think the concept as such was fine and it served
> us
> well over the years, but building has been challenging 20 years ago
> and
> would require some fierce source code editing nowadays. :-(
>
> > So this patch as written replaces the invocation of mhc with an
> > invocation of the python script, which of course drops various
> > features.
>
> Yeah! I was hoping we could return to your script. IIRC I once pinged
> and
> you were busy; happy to collaborate on finishing this up.
As it happens, I had entirely forgotten about this earlier work until
you and Joseph mentioned it.
For reference the old patch is here:
https://gcc.gnu.org/legacy-ml/gcc-patches/2018-06/msg00176.html
Maybe I can allocate some cycles in stage 4 to fully eliminating mhc
from the website build.
Dave
@@ -33,8 +33,6 @@
#
# By Gerald Pfeifer <pfeifer@dbai.tuwien.ac.at> 1999-12-29.
-MHC=${MHC-/usr/local/bin/mhc}
-
SOURCETREE=${SOURCETREE-/www/gcc/htdocs-preformatted}
DESTTREE=${DESTTREE-/www/gcc/htdocs}
@@ -114,9 +112,9 @@ process_html_file()
printf '<set-var MHTML::INCLUDE-PREFIX="%s">\n' `pwd` >> $TMPDIR/input
cat $f >> $TMPDIR/input
- if ! ${MHC} $TMPDIR/input > $TMPDIR/output.raw; then
- echo "${MHC} failed; aborting."
- exit 1
+ if ! python3 $SOURCETREE/bin/process_html.py $TMPDIR/input $TMPDIR/output.raw; then
+ echo "bin/process_html.py failed; aborting."
+ exit 1
fi
# Use sed to work around makeinfo 4.7 brokenness.
@@ -227,11 +225,6 @@ shift `expr ${OPTIND} - 1`
## Various safety checks.
-if ! ${MHC} --version >/dev/null; then
- echo "Something does not look right with \"${MHC}\"; aborting."
- exit 1
-fi
-
if [ ! -d $SOURCETREE ]; then
echo "Source tree \"$SOURCETREE\" does not exist."
exit 1
new file mode 100644
@@ -0,0 +1,32 @@
+#! /usr/bin/python3
+#
+# Python 3 script to preprocess .html files below htdocs
+
+import re
+import sys
+
+input_path = sys.argv[1]
+output_path = sys.argv[2]
+
+with open(input_path) as f_in:
+ with open(output_path, 'w') as f_out:
+ for line in f_in:
+ # Convert from e.g.
+ # <h1 id="ID">TEXT</h1>
+ # to:
+ # <h1 id="ID"><a href="#ID">TEXT</a></h1>
+ for element_name in {'h1', 'h2', 'h3', 'h4'}:
+ pattern = \
+ (r'<'
+ + element_name
+ + r' id="(.+)">(.+)</'
+ + element_name
+ + '>')
+ replacement = \
+ (r'<'
+ + element_name
+ + r' id="\1"><a href="#\1">\2</a></'
+ + element_name
+ + '>')
+ line = re.sub(pattern, replacement, line)
+ f_out.write(line)