[v2] binutils/windmc: Parse input correctly on big endian hosts

Message ID 20240124122523.384659-2-rjones@redhat.com
State New
Headers
Series [v2] binutils/windmc: Parse input correctly on big endian hosts |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_binutils_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_binutils_check--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_binutils_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_binutils_check--master-arm success Testing passed

Commit Message

Richard W.M. Jones Jan. 24, 2024, 12:25 p.m. UTC
  On big endian hosts (eg. s390x) the windmc tool fails to parse even
trivial files:

  $ cat test.mc
  ;
  $ ./binutils/windmc ./test.mc
  In test.mc at line 1: parser: syntax error.
  In test.mc at line 1: fatal: syntax error.

The tool starts by reading the input as Windows CP1252 and then
converting it internally into an array of UTF-16LE, which it then
processes as an array of unsigned short (typedef unichar).

There are lots of ways this is wrong, but in the specific case of big
endian machines the little endian pairs of bytes are byte-swapped.

For example, the ';' character in the input above is first converted
to UTF16-LE byte sequence { 0x3b, 0x00 }, which is then cast to
unsigned short.  On a big endian machine the first unichar appears to
be 0x3b00.  The lexer is unable to recognize this as the comment
character ((unichar)';') and so parsing fails.

The simple fix is to convert the input to UTF-16BE on big endian
machines (and do the reverse conversion when writing the output).

Fixes: https://sourceware.org/bugzilla/show_bug.cgi?id=31283
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
---
 binutils/configure.ac |  2 ++
 binutils/winduni.c    | 16 ++++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)
  

Comments

Nick Clifton Jan. 25, 2024, 3:18 p.m. UTC | #1
Hi Richard,

> The simple fix is to convert the input to UTF-16BE on big endian
> machines (and do the reverse conversion when writing the output).

Approved - please apply (mainline and if you wish, the 2.42 branch as well).

Cheers
   Nick
  
Richard W.M. Jones Jan. 25, 2024, 3:52 p.m. UTC | #2
On Thu, Jan 25, 2024 at 03:18:23PM +0000, Nick Clifton wrote:
> Hi Richard,
> 
> >The simple fix is to convert the input to UTF-16BE on big endian
> >machines (and do the reverse conversion when writing the output).
> 
> Approved - please apply (mainline and if you wish, the 2.42 branch as well).

Thanks - I guess I'm not allowed to push the patch though?

Rich.
  
Nick Clifton Jan. 26, 2024, 10:08 a.m. UTC | #3
Hi Rich,

>>> The simple fix is to convert the input to UTF-16BE on big endian
>>> machines (and do the reverse conversion when writing the output).
>>
>> Approved - please apply (mainline and if you wish, the 2.42 branch as well).
> 
> Thanks - I guess I'm not allowed to push the patch though?

No you are allowed to push it.

Sorry for the confusion.  By "apply" I actually meant "commit and push".

Cheers
   Nick
  
Richard W.M. Jones Jan. 26, 2024, 10:12 a.m. UTC | #4
On Fri, Jan 26, 2024 at 10:08:32AM +0000, Nick Clifton wrote:
> Hi Rich,
> 
> >>>The simple fix is to convert the input to UTF-16BE on big endian
> >>>machines (and do the reverse conversion when writing the output).
> >>
> >>Approved - please apply (mainline and if you wish, the 2.42 branch as well).
> >
> >Thanks - I guess I'm not allowed to push the patch though?
> 
> No you are allowed to push it.
> 
> Sorry for the confusion.  By "apply" I actually meant "commit and push".

Thanks.  Note that someone will need to rerun autoconf for this change
to actually have an effect.  I didn't include that in the patch as the
change is rather large and all in generated code.

Rich.
  
Nick Clifton Jan. 26, 2024, 11:51 a.m. UTC | #5
Hi Rich.

> Thanks.  Note that someone will need to rerun autoconf for this change
> to actually have an effect.  I didn't include that in the patch as the
> change is rather large and all in generated code.

Thanks for doing that - having uncluttered patched to review is much appreciated.

Rerunning autoconf got the 2.42 branch will not be a problem - I will be
doing that as part of the release creation process which should be happening
on Monday.  Rerunning it on the mainline should also not be a big issue,
as long as I notice your patch going in.  If I forget, please could you ping
me ?

Cheers
  Nick
  
Alan Modra Feb. 8, 2024, 9:50 a.m. UTC | #6
On Fri, Jan 26, 2024 at 11:51:07AM +0000, Nick Clifton wrote:
> > Thanks.  Note that someone will need to rerun autoconf for this change
> > to actually have an effect.  I didn't include that in the patch as the
> > change is rather large and all in generated code.
> 
> Thanks for doing that - having uncluttered patched to review is much appreciated.
> 
> Rerunning autoconf on the 2.42 branch will not be a problem - I will be
> doing that as part of the release creation process which should be happening
> on Monday.  Rerunning it on the mainline should also not be a big issue,
> as long as I notice your patch going in.  If I forget, please could you ping
> me ?

I pushed the patch for Richard.
  

Patch

diff --git a/binutils/configure.ac b/binutils/configure.ac
index b03e36c9e0e..dac72c1bdd4 100644
--- a/binutils/configure.ac
+++ b/binutils/configure.ac
@@ -31,6 +31,8 @@  AC_PROG_CC
 AC_GNU_SOURCE
 AC_USE_SYSTEM_EXTENSIONS
 
+AC_C_BIGENDIAN
+
 LT_INIT
 ACX_LARGEFILE
 
diff --git a/binutils/winduni.c b/binutils/winduni.c
index 5b659764948..f19de4f8cb3 100644
--- a/binutils/winduni.c
+++ b/binutils/winduni.c
@@ -771,7 +771,13 @@  wind_MultiByteToWideChar (rc_uint_type cp, const char *mb,
 
   if (!mb || !iconv_name)
     return 0;
-  iconv_t cd = iconv_open ("UTF-16LE", iconv_name);
+  iconv_t cd = iconv_open (
+#if WORDS_BIGENDIAN
+			   "UTF-16BE",
+#else
+			   "UTF-16LE",
+#endif
+			   iconv_name);
 
   while (1)
     {
@@ -844,7 +850,13 @@  wind_WideCharToMultiByte (rc_uint_type cp, const unichar *u, char *mb, rc_uint_t
 
   if (!u || !iconv_name)
     return 0;
-  iconv_t cd = iconv_open (iconv_name, "UTF-16LE");
+  iconv_t cd = iconv_open (iconv_name,
+#if WORDS_BIGENDIAN
+			   "UTF-16BE"
+#else
+			   "UTF-16LE"
+#endif
+			   );
 
   while (1)
     {