[RFC] newlib/libc/include/langinfo.h: nl_langinfo enum off by one error causing pointer overwrite

Message ID d1a0b9da87e90420bbaf192ba01a141386f07966.1724283807.git.Brian.Inglis@SystematicSW.ab.ca
State New
Headers
Series [RFC] newlib/libc/include/langinfo.h: nl_langinfo enum off by one error causing pointer overwrite |

Commit Message

Brian Inglis Aug. 21, 2024, 11:58 p.m. UTC
  if __HAVE_LOCALE_INFO__ is defined, then _NL_MESSAGES_CODESET is defined
instead of _NL_CTYPE_CODESET after _NL_CTYPE_MB_CUR_MAX, and
_NL_MESSAGES_CODESET is omitted before _NL_MESSAGES_WYESEXPR, resulting
in _NL_MESSAGES_WNOSTR pointer being overwritten by _NL_COLLATE_CODESET
pointer

demonstration of pointer overwrite by nl_langinfo dump program generated
with additional langinfo.h hack after fix:

    $ diff -4 /usr{/src/newlib-cygwin/newlib/libc,}/include/langinfo.h
    --- /usr/src/newlib-cygwin/newlib/libc/include/langinfo.h       2024-08-21 17:11:00.852069900 -0600
    +++ /usr/include/langinfo.h     2024-08-21 16:59:59.609998900 -0600
    @@ -301,9 +301,9 @@ enum
       _NL_MESSAGES_CODESET,
       _NL_MESSAGES_WYESEXPR,
       _NL_MESSAGES_WNOEXPR,
       _NL_MESSAGES_WYESSTR,
    -  _NL_MESSAGES_WNOSTR,
    +#define _NL_MESSAGES_WNOSTR _NL_MESSAGES_WYESSTR

       _NL_COLLATE_CODESET,

MESSAGES_CODESET                 UTF-8
MESSAGES_WYESEXPR                ^[+1yY]
MESSAGES_WNOEXPR                 ^[-0nN]
MESSAGES_WYESSTR                 yes
79 y 65 e 73 s 0 ^@ 6e n 6f o 0 ^@ 0 ^@	    # %hx %C dump
COLLATE_CODESET                  UTF-8

if __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
__HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
depending on those definitions

Signed-off-by: Brian Inglis <Brian.Inglis@SystematicSW.ab.ca>
---
 newlib/libc/include/langinfo.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
  

Comments

Corinna Vinschen Aug. 22, 2024, 8:07 a.m. UTC | #1
Hi Brian,

On Aug 21 17:58, Brian Inglis wrote:
> if __HAVE_LOCALE_INFO__ is defined, then _NL_MESSAGES_CODESET is defined
> instead of _NL_CTYPE_CODESET

There is no _NL_CTYPE_CODESET, only CODESET, and the value of
_NL_MESSAGES_CODESET is what it is for backward compatibility.

The values must not change.

> demonstration of pointer overwrite by nl_langinfo dump program generated
> with additional langinfo.h hack after fix:

Can you please provide your STC?

> if __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
> also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
> defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
> __HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
> depending on those definitions

As I wrote above, the values must not change.  And given _NL_CTYPE_CODESET
doesn't really exist (but CODESET does), I'm not sure I understand the
problem here.  As on Linux we have a value _NL_CTYPE_CODESET_NAME which
is equivalent to CODESET.


Corinna
  
Brian Inglis Aug. 22, 2024, 5:14 p.m. UTC | #2
On 2024-08-22 02:07, Corinna Vinschen wrote:
> Hi Brian,
> 
> On Aug 21 17:58, Brian Inglis wrote:
>> if __HAVE_LOCALE_INFO__ is defined, then _NL_MESSAGES_CODESET is defined
>> instead of _NL_CTYPE_CODESET
> 
> There is no _NL_CTYPE_CODESET, only CODESET, and the value of
> _NL_MESSAGES_CODESET is what it is for backward compatibility.
> 
> The values must not change.

What about the pointer positions in the array?

>> demonstration of pointer overwrite by nl_langinfo dump program generated
>> with additional langinfo.h hack after fix:
> 
> Can you please provide your STC?

Attached, with log, assuming current langinfo.h.
Also attached is my original program, for use with appropriately fixed and/or 
hacked up langinfo.h, with my nl_items label array adjusted to match.

>> if __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
>> also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
>> defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
>> __HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
>> depending on those definitions
> 
> As I wrote above, the values must not change.  And given _NL_CTYPE_CODESET
> doesn't really exist (but CODESET does), I'm not sure I understand the
> problem here.  As on Linux we have a value _NL_CTYPE_CODESET_NAME which
> is equivalent to CODESET.

An instance of a CODESET item and pointer exists, before the wide messages 
fields, and not accounted for in the enum defined in langinfo.h, used to 
retrieve the values by nl_langinfo(3).

What is shown in the enum as the _NL_MESSAGE_CODESET item appears to be 
generated from the LC_CTYPE locale entry, and added to the buffer and pointer 
array after the _NL_CTYPE_MB_CUR_MAX pointer, before the _NL_CTYPE_OUTDIGIT?_MB 
item pointers.

What is not shown in the enum is the CODESET item generated from the LC_MESSAGES 
locale entry, and added to the buffer and pointer array after the 
_NL_MONETARY_WNEGATIVE_SIGN pointer before the _NL_MESSAGES_WYESEXPR pointer.

As a result, the underlying pointers in the pointer array are off by one in the 
wide messages fields, and the _NL_COLLATE_CODESET pointer clobbers what was 
stored in the _NL_MESSAGES_WNOSTR position in the pointer array.

This is shown below from the STC log attached: you can see in the dump that the 
underlying data from the current WNOSTR item, has L"yes" followed by L"no" L"", 
and the following COLLATE_CODESET item looks normal:

MESSAGES_WYESEXPR                呕ⵆ8孞ㄫ她佯]孞〭乮]敹s潮帀嬀⬀㄀礀夀漀伀崀
5455 '呕' 2d46 'ⵆ' 0038 '8' 5b5e '孞' 312b 'ㄫ' 5979 '她' 4f6f '佯' 005d ']'
55 'U' 54 'T' 46 'F' 2d '-' 38 '8' 00 '' 5e '^' 5b '[' 2b '+' 31 '1' 79 'y' 59 
'Y' 6f 'o' 4f 'O' 5d ']' 00 ''
MESSAGES_WNOEXPR                 ^[+1yYoO]
005e '^' 005b '[' 002b '+' 0031 '1' 0079 'y' 0059 'Y' 006f 'o' 004f 'O'
5e '^' 00 '' 5b '[' 00 '' 2b '+' 00 '' 31 '1' 00 '' 79 'y' 00 '' 59 'Y' 00 '' 6f 
'o' 00 '' 4f 'O' 00 ''
MESSAGES_WYESSTR                 ^[-0nN]
005e '^' 005b '[' 002d '-' 0030 '0' 006e 'n' 004e 'N' 005d ']' 0000 ''
5e '^' 00 '' 5b '[' 00 '' 2d '-' 00 '' 30 '0' 00 '' 6e 'n' 00 '' 4e 'N' 00 '' 5d 
']' 00 '' 00 '' 00 ''
MESSAGES_WNOSTR                  yes
0079 'y' 0065 'e' 0073 's' 0000 '' 006e 'n' 006f 'o' 0000 '' 0000 ''
79 'y' 00 '' 65 'e' 00 '' 73 's' 00 '' 00 '' 00 '' 6e 'n' 00 '' 6f 'o' 00 '' 00 
'' 00 '' 00 '' 00 ''
COLLATE_CODESET                  UTF-8
5455 '呕' 2d46 'ⵆ' 0038 '8' 0000 '' 0000 '' 0000 '' 0000 '' 0000 ''
55 'U' 54 'T' 46 'F' 2d '-' 38 '8' 00 '' 00 '' 00 '' 00 '' 00 '' 00 '' 00 '' 00 
'' 00 '' 00 '' 00 ''

So the generated locale data contains a (message) codeset preceding the wide 
message fields, and this is missing from the enum, so that the following item 
pointers are off by one, and the collate codeset pointer and item look correct, 
as the pointer clobbered that for WNOSTR.

Additional langinfo.h hack after fix to show existing problem:

     $ diff -4 /usr{/src/newlib-cygwin/newlib/libc,}/include/langinfo.h
     --- /usr/src/newlib-cygwin/newlib/libc/include/langinfo.h       2024-08-21 
17:11:00.852069900 -0600
     +++ /usr/include/langinfo.h     2024-08-21 16:59:59.609998900 -0600
     @@ -301,9 +301,9 @@ enum
        _NL_MESSAGES_CODESET,
        _NL_MESSAGES_WYESEXPR,
        _NL_MESSAGES_WNOEXPR,
        _NL_MESSAGES_WYESSTR,
     -  _NL_MESSAGES_WNOSTR,
     +#define _NL_MESSAGES_WNOSTR _NL_MESSAGES_WYESSTR

        _NL_COLLATE_CODESET,

Demonstration of pointer overwrite by nl_langinfo dump program run with patch 
and above hack:

MESSAGES_CODESET                 UTF-8
MESSAGES_WYESEXPR                ^[+1yY]
MESSAGES_WNOEXPR                 ^[-0nN]
MESSAGES_WYESSTR                 yes
79 y 65 e 73 s 0 ^@ 6e n 6f o 0 ^@ 0 ^@	    # %hx %C dump
COLLATE_CODESET                  UTF-8

If __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
__HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
depending on those definitions?
  
Corinna Vinschen Aug. 22, 2024, 7:13 p.m. UTC | #3
On Aug 22 11:14, Brian Inglis wrote:
> On 2024-08-22 02:07, Corinna Vinschen wrote:
> > Hi Brian,
> > 
> > On Aug 21 17:58, Brian Inglis wrote:
> > > if __HAVE_LOCALE_INFO__ is defined, then _NL_MESSAGES_CODESET is defined
> > > instead of _NL_CTYPE_CODESET
> > 
> > There is no _NL_CTYPE_CODESET, only CODESET, and the value of
> > _NL_MESSAGES_CODESET is what it is for backward compatibility.
> > 
> > The values must not change.
> 
> What about the pointer positions in the array?
> 
> > > demonstration of pointer overwrite by nl_langinfo dump program generated
> > > with additional langinfo.h hack after fix:
> > 
> > Can you please provide your STC?
> 
> Attached, with log, assuming current langinfo.h.
> Also attached is my original program, for use with appropriately fixed
> and/or hacked up langinfo.h, with my nl_items label array adjusted to match.
> 
> > > if __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
> > > also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
> > > defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
> > > __HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
> > > depending on those definitions
> > 
> > As I wrote above, the values must not change.  And given _NL_CTYPE_CODESET
> > doesn't really exist (but CODESET does), I'm not sure I understand the
> > problem here.  As on Linux we have a value _NL_CTYPE_CODESET_NAME which
> > is equivalent to CODESET.
> 
> An instance of a CODESET item and pointer exists, before the wide messages
> fields, and not accounted for in the enum defined in langinfo.h, used to
> retrieve the values by nl_langinfo(3).

I don't understand this sentence.  There is a _NL_MESSAGES_CODESET at this
point, and the value is *not* supposed to be a pointer into an array or
struct.  Whatever is wrong, it's not the values defined in langinfo.h

> What is shown in the enum as the _NL_MESSAGE_CODESET item appears to be
> generated from the LC_CTYPE locale entry, and added to the buffer and
> pointer array after the _NL_CTYPE_MB_CUR_MAX pointer, before the
> _NL_CTYPE_OUTDIGIT?_MB item pointers.
> 
> What is not shown in the enum is the CODESET item generated from the
> LC_MESSAGES locale entry, and added to the buffer and pointer array after
> the _NL_MONETARY_WNEGATIVE_SIGN pointer before the _NL_MESSAGES_WYESEXPR
> pointer.
> 

This is a wrong assumption.  Compare with newlib/libc/locale/nl_langinfo.c,
line 183.

> As a result, the underlying pointers in the pointer array are off by one in
> the wide messages fields, and the _NL_COLLATE_CODESET pointer clobbers what
> was stored in the _NL_MESSAGES_WNOSTR position in the pointer array.

Yes, youre' right, there's an off by one.  But this can't be fixed by
chaning nl_item values.

The culprit is actually in nl_langinfo.c.  The internal nl_ext array
erronously contains an entry for lc_monetary_T::codeset which is not
part of the _NL_LOCALE_EXTENDED entries.  This moves the wide char
yes/no strings accidentally by one.  The patch is simple:

diff --git a/newlib/libc/locale/nl_langinfo.c b/newlib/libc/locale/nl_langinfo.c
index c34a7d131376..4477d833bec1 100644
--- a/newlib/libc/locale/nl_langinfo.c
+++ b/newlib/libc/locale/nl_langinfo.c
@@ -160,7 +160,6 @@ static struct _nl_item_t
   _NLITEM (monetary, wmon_thousands_sep),
   _NLITEM (monetary, wpositive_sign),
   _NLITEM (monetary, wnegative_sign),
-  _NLITEM (messages, codeset),
   _NLITEM (messages, wyesexpr),
   _NLITEM (messages, wnoexpr),
   _NLITEM (messages, wyesstr),

> If __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
> also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
> defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
> __HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
> depending on those definitions?

No, see above.  There is no definition of _NL_CTYPE_CODESET, only
_NL_CTYPE_CODESET_NAME == CODESET.  _NL_MESSAGES_CODESET is exactly
where it has to be. If you start moving _NL_* values around, you *will*
break backward compatibility with existing executables.

However, there *is* another problem.  _NL_MESSAGES_CODESET is defined if
__HAVE_LOCALE_INFO__ is defined.  OTOH, the internal datatype struct
lc_messages_T (see libc/locale/setlocale.h) defines the codeset member
only if __HAVE_LOCALE_INFO_EXTENDED__ is defined.  This is no problem at
all for Cygwin, which defines __HAVE_LOCALE_INFO_EXTENDED__ anyway, but
a target only defining __HAVE_LOCALE_INFO__ would probably see an
error during compilation because lc_messages_T::codeset will be
undefined.

This is... weird.

Does no other target except Cygwin define __HAVE_LOCALE_INFO__???

Anyway, I'll fix it.


Thanks,
Corinna
  
Brian Inglis Aug. 22, 2024, 9:25 p.m. UTC | #4
On 2024-08-22 13:13, Corinna Vinschen wrote:
> On Aug 22 11:14, Brian Inglis wrote:
>> On 2024-08-22 02:07, Corinna Vinschen wrote:
>>> Hi Brian,
>>>
>>> On Aug 21 17:58, Brian Inglis wrote:
>>>> if __HAVE_LOCALE_INFO__ is defined, then _NL_MESSAGES_CODESET is defined
>>>> instead of _NL_CTYPE_CODESET
>>>
>>> There is no _NL_CTYPE_CODESET, only CODESET, and the value of
>>> _NL_MESSAGES_CODESET is what it is for backward compatibility.
>>>
>>> The values must not change.
>>
>> What about the pointer positions in the array?
>>
>>>> demonstration of pointer overwrite by nl_langinfo dump program generated
>>>> with additional langinfo.h hack after fix:
>>>
>>> Can you please provide your STC?
>>
>> Attached, with log, assuming current langinfo.h.
>> Also attached is my original program, for use with appropriately fixed
>> and/or hacked up langinfo.h, with my nl_items label array adjusted to match.
>>
>>>> if __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
>>>> also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
>>>> defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
>>>> __HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
>>>> depending on those definitions
>>>
>>> As I wrote above, the values must not change.  And given _NL_CTYPE_CODESET
>>> doesn't really exist (but CODESET does), I'm not sure I understand the
>>> problem here.  As on Linux we have a value _NL_CTYPE_CODESET_NAME which
>>> is equivalent to CODESET.
>>
>> An instance of a CODESET item and pointer exists, before the wide messages
>> fields, and not accounted for in the enum defined in langinfo.h, used to
>> retrieve the values by nl_langinfo(3).
> 
> I don't understand this sentence.  There is a _NL_MESSAGES_CODESET at this
> point, and the value is *not* supposed to be a pointer into an array or
> struct.  Whatever is wrong, it's not the values defined in langinfo.h
> 
>> What is shown in the enum as the _NL_MESSAGE_CODESET item appears to be
>> generated from the LC_CTYPE locale entry, and added to the buffer and
>> pointer array after the _NL_CTYPE_MB_CUR_MAX pointer, before the
>> _NL_CTYPE_OUTDIGIT?_MB item pointers.
>>
>> What is not shown in the enum is the CODESET item generated from the
>> LC_MESSAGES locale entry, and added to the buffer and pointer array after
>> the _NL_MONETARY_WNEGATIVE_SIGN pointer before the _NL_MESSAGES_WYESEXPR
>> pointer.
>>
> 
> This is a wrong assumption.  Compare with newlib/libc/locale/nl_langinfo.c,
> line 183.
> 
>> As a result, the underlying pointers in the pointer array are off by one in
>> the wide messages fields, and the _NL_COLLATE_CODESET pointer clobbers what
>> was stored in the _NL_MESSAGES_WNOSTR position in the pointer array.
> 
> Yes, youre' right, there's an off by one.  But this can't be fixed by
> chaning nl_item values.
> 
> The culprit is actually in nl_langinfo.c.  The internal nl_ext array
> erronously contains an entry for lc_monetary_T::codeset which is not
> part of the _NL_LOCALE_EXTENDED entries.  This moves the wide char
> yes/no strings accidentally by one.  The patch is simple:
> 
> diff --git a/newlib/libc/locale/nl_langinfo.c b/newlib/libc/locale/nl_langinfo.c
> index c34a7d131376..4477d833bec1 100644
> --- a/newlib/libc/locale/nl_langinfo.c
> +++ b/newlib/libc/locale/nl_langinfo.c
> @@ -160,7 +160,6 @@ static struct _nl_item_t
>     _NLITEM (monetary, wmon_thousands_sep),
>     _NLITEM (monetary, wpositive_sign),
>     _NLITEM (monetary, wnegative_sign),
> -  _NLITEM (messages, codeset),
>     _NLITEM (messages, wyesexpr),
>     _NLITEM (messages, wnoexpr),
>     _NLITEM (messages, wyesstr),
> 
>> If __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
>> also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
>> defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
>> __HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
>> depending on those definitions?
> 
> No, see above.  There is no definition of _NL_CTYPE_CODESET, only
> _NL_CTYPE_CODESET_NAME == CODESET.  _NL_MESSAGES_CODESET is exactly
> where it has to be. If you start moving _NL_* values around, you *will*
> break backward compatibility with existing executables.
> 
> However, there *is* another problem.  _NL_MESSAGES_CODESET is defined if
> __HAVE_LOCALE_INFO__ is defined.  OTOH, the internal datatype struct
> lc_messages_T (see libc/locale/setlocale.h) defines the codeset member
> only if __HAVE_LOCALE_INFO_EXTENDED__ is defined.  This is no problem at
> all for Cygwin, which defines __HAVE_LOCALE_INFO_EXTENDED__ anyway, but
> a target only defining __HAVE_LOCALE_INFO__ would probably see an
> error during compilation because lc_messages_T::codeset will be
> undefined.
> 
> This is... weird.
> 
> Does no other target except Cygwin define __HAVE_LOCALE_INFO__???
> 
> Anyway, I'll fix it.

But aren't those entries populated from Windows, where the codeset is copied in
winsup/cygwin/nlsfuncs.cc(__set_lc_messages_from_win)

https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/nlsfuncs.cc#l1114

into newlib/libc/locale/setlocale.h(lc_messages_t):

https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=newlib/libc/locale/setlocale.h#l166

which includes codeset?
  
Corinna Vinschen Aug. 23, 2024, 8:39 a.m. UTC | #5
On Aug 22 15:25, Brian Inglis wrote:
> On 2024-08-22 13:13, Corinna Vinschen wrote:
> > On Aug 22 11:14, Brian Inglis wrote:
> > > As a result, the underlying pointers in the pointer array are off by one in
> > > the wide messages fields, and the _NL_COLLATE_CODESET pointer clobbers what
> > > was stored in the _NL_MESSAGES_WNOSTR position in the pointer array.
> > 
> > Yes, youre' right, there's an off by one.  But this can't be fixed by
> > chaning nl_item values.
> > 
> > The culprit is actually in nl_langinfo.c.  The internal nl_ext array
> > erronously contains an entry for lc_monetary_T::codeset which is not
> > part of the _NL_LOCALE_EXTENDED entries.  This moves the wide char
> > yes/no strings accidentally by one.  The patch is simple:
> > 
> > diff --git a/newlib/libc/locale/nl_langinfo.c b/newlib/libc/locale/nl_langinfo.c
> > index c34a7d131376..4477d833bec1 100644
> > --- a/newlib/libc/locale/nl_langinfo.c
> > +++ b/newlib/libc/locale/nl_langinfo.c
> > @@ -160,7 +160,6 @@ static struct _nl_item_t
> >     _NLITEM (monetary, wmon_thousands_sep),
> >     _NLITEM (monetary, wpositive_sign),
> >     _NLITEM (monetary, wnegative_sign),
> > -  _NLITEM (messages, codeset),
> >     _NLITEM (messages, wyesexpr),
> >     _NLITEM (messages, wnoexpr),
> >     _NLITEM (messages, wyesstr),
> > 
> > > If __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
> > > also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
> > > defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
> > > __HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
> > > depending on those definitions?
> > 
> > No, see above.  There is no definition of _NL_CTYPE_CODESET, only
> > _NL_CTYPE_CODESET_NAME == CODESET.  _NL_MESSAGES_CODESET is exactly
> > where it has to be. If you start moving _NL_* values around, you *will*
> > break backward compatibility with existing executables.
> > [...]
> 
> But aren't those entries populated from Windows, where the codeset is copied in
> winsup/cygwin/nlsfuncs.cc(__set_lc_messages_from_win)
> 
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/nlsfuncs.cc#l1114
> 
> into newlib/libc/locale/setlocale.h(lc_messages_t):
> 
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=newlib/libc/locale/setlocale.h#l166
> 
> which includes codeset?

No.  First of all, apart from fetching LANG from the windows locale
at startup (see /etc/profile.d/lang.*), the codeset is maintained by
Cygwin/newlib alone, via LC_* environment and setlocale() calls in
the application.

Also, Windows has no GetLocaleInfo call for fetching locale-specific
yes/no strings.  I. e., there's no equivalent for LC_MESSAGES.

Therefore Cygwin fetches the information from static information
maintained in winsup/cygwin/local_includes/lc_msg.h.

If you look closely into __set_lc_messages_from_win(), you'll
notice that the __get_rfc5646_from_locale() function is called
from __set_lc_messages_from_win() *only* to check if a matching
Windows locale exists.  This is a pure sanity check.  After that,
any further information is taken from the lc_msg struct defined
in lc_msg.h, which in turn has been fetched from Linux.  See
winsup/cygwin/linux-locale-helpers.

Does that make sense?


Thanks,
Corinna
  
Brian Inglis Aug. 24, 2024, 4:54 p.m. UTC | #6
On 2024-08-23 02:39, Corinna Vinschen wrote:
> On Aug 22 15:25, Brian Inglis wrote:
>> On 2024-08-22 13:13, Corinna Vinschen wrote:
>>> On Aug 22 11:14, Brian Inglis wrote:
>>>> As a result, the underlying pointers in the pointer array are off by one in
>>>> the wide messages fields, and the _NL_COLLATE_CODESET pointer clobbers what
>>>> was stored in the _NL_MESSAGES_WNOSTR position in the pointer array.
>>>
>>> Yes, youre' right, there's an off by one.  But this can't be fixed by
>>> chaning nl_item values.
>>>
>>> The culprit is actually in nl_langinfo.c.  The internal nl_ext array
>>> erronously contains an entry for lc_monetary_T::codeset which is not
>>> part of the _NL_LOCALE_EXTENDED entries.  This moves the wide char
>>> yes/no strings accidentally by one.  The patch is simple:
>>>
>>> diff --git a/newlib/libc/locale/nl_langinfo.c b/newlib/libc/locale/nl_langinfo.c
>>> index c34a7d131376..4477d833bec1 100644
>>> --- a/newlib/libc/locale/nl_langinfo.c
>>> +++ b/newlib/libc/locale/nl_langinfo.c
>>> @@ -160,7 +160,6 @@ static struct _nl_item_t
>>>      _NLITEM (monetary, wmon_thousands_sep),
>>>      _NLITEM (monetary, wpositive_sign),
>>>      _NLITEM (monetary, wnegative_sign),
>>> -  _NLITEM (messages, codeset),
>>>      _NLITEM (messages, wyesexpr),
>>>      _NLITEM (messages, wnoexpr),
>>>      _NLITEM (messages, wyesstr),
>>>
>>>> If __HAVE_LOCALE_INFO__ is not defined, then _NL_MESSAGES_CODESET is
>>>> also not defined, so it is unclear if _NL_MESSAGES_CODESET should be
>>>> defined to _NL_CTYPE_CODESET if neither __HAVE_LOCALE_INFO__ nor
>>>> __HAVE_LOCALE_INFO_EXTENDED__ are defined, or added as another field
>>>> depending on those definitions?
>>>
>>> No, see above.  There is no definition of _NL_CTYPE_CODESET, only
>>> _NL_CTYPE_CODESET_NAME == CODESET.  _NL_MESSAGES_CODESET is exactly
>>> where it has to be. If you start moving _NL_* values around, you *will*
>>> break backward compatibility with existing executables.
>>> [...]
>>
>> But aren't those entries populated from Windows, where the codeset is copied in
>> winsup/cygwin/nlsfuncs.cc(__set_lc_messages_from_win)
>>
>> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/nlsfuncs.cc#l1114
>>
>> into newlib/libc/locale/setlocale.h(lc_messages_t):
>>
>> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=newlib/libc/locale/setlocale.h#l166
>>
>> which includes codeset?
> 
> No.  First of all, apart from fetching LANG from the windows locale
> at startup (see /etc/profile.d/lang.*), the codeset is maintained by
> Cygwin/newlib alone, via LC_* environment and setlocale() calls in
> the application.
> 
> Also, Windows has no GetLocaleInfo call for fetching locale-specific
> yes/no strings.  I. e., there's no equivalent for LC_MESSAGES.
> 
> Therefore Cygwin fetches the information from static information
> maintained in winsup/cygwin/local_includes/lc_msg.h.
> 
> If you look closely into __set_lc_messages_from_win(), you'll
> notice that the __get_rfc5646_from_locale() function is called
> from __set_lc_messages_from_win() *only* to check if a matching
> Windows locale exists.  This is a pure sanity check.  After that,
> any further information is taken from the lc_msg struct defined
> in lc_msg.h, which in turn has been fetched from Linux.  See
> winsup/cygwin/linux-locale-helpers.
> 
> Does that make sense?

Yes - I dug into all those sources and that is not any issue.

But __set_lc_messages_from_win() then adds the codeset field before the wide 
character fields, as with the other Cygwin functions, so would it not be more 
consistent and simpler to just add an enum entry for the codeset, called 
something unique including ...CODESET..., keeping the entry and realigning tha 
last few indexes with the items, as nobody else has noticed any issue in the 
last 15 years ;^>
  
Corinna Vinschen Aug. 26, 2024, 8:42 a.m. UTC | #7
On Aug 24 10:54, Brian Inglis wrote:
> On 2024-08-23 02:39, Corinna Vinschen wrote:
> > No.  First of all, apart from fetching LANG from the windows locale
> > at startup (see /etc/profile.d/lang.*), the codeset is maintained by
> > Cygwin/newlib alone, via LC_* environment and setlocale() calls in
> > the application.
> > 
> > Also, Windows has no GetLocaleInfo call for fetching locale-specific
> > yes/no strings.  I. e., there's no equivalent for LC_MESSAGES.
> > 
> > Therefore Cygwin fetches the information from static information
> > maintained in winsup/cygwin/local_includes/lc_msg.h.
> > 
> > If you look closely into __set_lc_messages_from_win(), you'll
> > notice that the __get_rfc5646_from_locale() function is called
> > from __set_lc_messages_from_win() *only* to check if a matching
> > Windows locale exists.  This is a pure sanity check.  After that,
> > any further information is taken from the lc_msg struct defined
> > in lc_msg.h, which in turn has been fetched from Linux.  See
> > winsup/cygwin/linux-locale-helpers.
> > 
> > Does that make sense?
> 
> Yes - I dug into all those sources and that is not any issue.
> 
> But __set_lc_messages_from_win() then adds the codeset field before the wide
> character fields, as with the other Cygwin functions, so would it not be
> more consistent and simpler to just add an enum entry for the codeset,
> called something unique including ...CODESET..., keeping the entry and
> realigning tha last few indexes with the items, as nobody else has noticed
> any issue in the last 15 years ;^>

Again: The values MUST NOT change for backward compat.

The order in the nl_item enum doesn't reflect a struct order anyway,
because the first items are the standarized ones, and only then the
extensions follow with the preference of values already defined with
__HAVE_LOCALE_INFO__.

Apart from that, the actual numeric values are meaningless.  They could
have been chosen by throwing a dice.

If you feel really strongly about, we can add a comment, like this:

diff --git a/newlib/libc/include/langinfo.h b/newlib/libc/include/langinfo.h
index 41d090d3710e..509e8fb0e493 100644
--- a/newlib/libc/include/langinfo.h
+++ b/newlib/libc/include/langinfo.h
@@ -298,6 +298,7 @@ enum
   _NL_MONETARY_WPOSITIVE_SIGN,
   _NL_MONETARY_WNEGATIVE_SIGN,
 
+  /* _NL_MESSAGES_CODESET already defined earlier for compatibility */
   _NL_MESSAGES_WYESEXPR,
   _NL_MESSAGES_WNOEXPR,
   _NL_MESSAGES_WYESSTR,


Corinna
  

Patch

diff --git a/newlib/libc/include/langinfo.h b/newlib/libc/include/langinfo.h
index 41d090d3710e..e383dc59d83f 100644
--- a/newlib/libc/include/langinfo.h
+++ b/newlib/libc/include/langinfo.h
@@ -179,7 +179,7 @@  enum
 
 #ifdef __HAVE_LOCALE_INFO__
   _NL_CTYPE_MB_CUR_MAX,
-  _NL_MESSAGES_CODESET,
+  _NL_CTYPE_CODESET,
 
 #ifdef __HAVE_LOCALE_INFO_EXTENDED__
 
@@ -298,6 +298,7 @@  enum
   _NL_MONETARY_WPOSITIVE_SIGN,
   _NL_MONETARY_WNEGATIVE_SIGN,
 
+  _NL_MESSAGES_CODESET,
   _NL_MESSAGES_WYESEXPR,
   _NL_MESSAGES_WNOEXPR,
   _NL_MESSAGES_WYESSTR,