intl: Treat C.UTF-8 locale like C locale, part 2 (BZ# 16621)

Message ID 20230910191739.1083016-1-bruno@clisp.org
State Committed
Commit d0aefec49941cf6d97e2244d6aa20bafc26d5942
Headers
Series intl: Treat C.UTF-8 locale like C locale, part 2 (BZ# 16621) |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed

Commit Message

Bruno Haible Sept. 10, 2023, 7:17 p.m. UTC
  The previous commit was incomplete: gettext() still returns a translation
if the file /usr/share/locale/C/LC_MESSAGES/<domain>.mo exists. This patch
prohibits the translation also in this case.

* gettext-runtime/intl/dcigettext.c (DCIGETTEXT): Treat C.<encoding> locale
like the C locale.
---
 intl/dcigettext.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)
  

Comments

Florian Weimer Sept. 11, 2023, 7:43 a.m. UTC | #1
* Bruno Haible:

> The previous commit was incomplete: gettext() still returns a translation
> if the file /usr/share/locale/C/LC_MESSAGES/<domain>.mo exists. This patch
> prohibits the translation also in this case.
>
> * gettext-runtime/intl/dcigettext.c (DCIGETTEXT): Treat C.<encoding> locale
> like the C locale.
> ---
>  intl/dcigettext.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/intl/dcigettext.c b/intl/dcigettext.c
> index 27063886d2..fb69bbf94b 100644
> --- a/intl/dcigettext.c
> +++ b/intl/dcigettext.c
> @@ -691,9 +691,10 @@ DCIGETTEXT (const char *domainname, const char *msgid1, const char *msgid2,
>  	    continue;
>  	}
>  
> -      /* If the current locale value is C (or POSIX) we don't load a
> -	 domain.  Return the MSGID.  */
> -      if (strcmp (single_locale, "C") == 0
> +      /* If the current locale value is "C" or "C.<encoding>" or "POSIX",
> +	 we don't load a domain.  Return the MSGID.  */
> +      if ((single_locale[0] == 'C'
> +	   && (single_locale[1] == '\0' || single_locale[1] == '.'))
>  	  || strcmp (single_locale, "POSIX") == 0)
>  	break;

I wasn't sure if this is a bug.  The implementation does not fallback to
translation, it just uses C as a message catalog.  Do you consider this
a problem?

Thanks,
Florian
  
Bruno Haible Sept. 12, 2023, 2:44 p.m. UTC | #2
Florian Weimer wrote:
> > The previous commit was incomplete: gettext() still returns a translation
> > if the file /usr/share/locale/C/LC_MESSAGES/<domain>.mo exists. This patch
> > prohibits the translation also in this case.
> 
> I wasn't sure if this is a bug.  The implementation does not fallback to
> translation, it just uses C as a message catalog.  Do you consider this
> a problem?

Yes, I consider this a bug, for two reasons:

* The wiki page https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 states
  "It shall be the C locale but with UTF-8 encodings."
  and
  "These will be the same as C... LC_MESSAGES"

  The C locale has the property that gettext() returns the msgid in all cases,
  regardless of what files are on disk and regardless of the values of any
  environment variables.

  If the C.UTF-8 has the property that gettext() returns msgid only if there
  is no translation catalog at /usr/share/locale/C/LC_MESSAGES/<domain>.mo,
  it is *not* the same as "the C locale but with UTF-8 encodings".

* We have this rule, that gettext() returns the msgid when the locale is the
  "C" locale, because
     - the POSIX standard specifies the precise output of some programs (e.g.
       'diff') in the C locale, and
     - we wanted, from the beginning in 1995, that gettext() can be used in
       the source code of these programs, without an explicit check for the
       locale.
  It is possible that, in the long run, POSIX adopts the C.UTF-8 locale,
  since several platforms already have it: glibc, musl libc, FreeBSD, NetBSD,
  OpenBSD, Cygwin, Android.
  When this happens, we want that the maintainers of 'diff' etc. can continue
  to use gettext(), without introducing an explicit check for the locale.

Bruno
  
Florian Weimer Dec. 12, 2023, 9:08 a.m. UTC | #3
* Bruno Haible:

> The previous commit was incomplete: gettext() still returns a translation
> if the file /usr/share/locale/C/LC_MESSAGES/<domain>.mo exists. This patch
> prohibits the translation also in this case.
>
> * gettext-runtime/intl/dcigettext.c (DCIGETTEXT): Treat C.<encoding> locale
> like the C locale.
> ---
>  intl/dcigettext.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/intl/dcigettext.c b/intl/dcigettext.c
> index 27063886d2..fb69bbf94b 100644
> --- a/intl/dcigettext.c
> +++ b/intl/dcigettext.c
> @@ -691,9 +691,10 @@ DCIGETTEXT (const char *domainname, const char *msgid1, const char *msgid2,
>  	    continue;
>  	}
>  
> -      /* If the current locale value is C (or POSIX) we don't load a
> -	 domain.  Return the MSGID.  */
> -      if (strcmp (single_locale, "C") == 0
> +      /* If the current locale value is "C" or "C.<encoding>" or "POSIX",
> +	 we don't load a domain.  Return the MSGID.  */
> +      if ((single_locale[0] == 'C'
> +	   && (single_locale[1] == '\0' || single_locale[1] == '.'))
>  	  || strcmp (single_locale, "POSIX") == 0)
>  	break;

These arguments:

> * The wiki page https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 states
>   "It shall be the C locale but with UTF-8 encodings."
>   and
>   "These will be the same as C... LC_MESSAGES"
> 
>   The C locale has the property that gettext() returns the msgid in all cases,
>   regardless of what files are on disk and regardless of the values of any
>   environment variables.
> 
>   If the C.UTF-8 has the property that gettext() returns msgid only if there
>   is no translation catalog at /usr/share/locale/C/LC_MESSAGES/<domain>.mo,
>   it is *not* the same as "the C locale but with UTF-8 encodings".
> 
> * We have this rule, that gettext() returns the msgid when the locale is the
>   "C" locale, because
>      - the POSIX standard specifies the precise output of some programs (e.g.
>        'diff') in the C locale, and
>      - we wanted, from the beginning in 1995, that gettext() can be used in
>        the source code of these programs, without an explicit check for the
>        locale.
>   It is possible that, in the long run, POSIX adopts the C.UTF-8 locale,
>   since several platforms already have it: glibc, musl libc, FreeBSD, NetBSD,
>   OpenBSD, Cygwin, Android.
>   When this happens, we want that the maintainers of 'diff' etc. can continue
>   to use gettext(), without introducing an explicit check for the locale.

<https://inbox.sourceware.org/libc-alpha/10272749.85pcf5A44T@nimes/>

convined me, so:

Reviewed-by: Florian Weimer <fweimer@redhat.com>

I'll push it for you.

Thanks,
Florian
  

Patch

diff --git a/intl/dcigettext.c b/intl/dcigettext.c
index 27063886d2..fb69bbf94b 100644
--- a/intl/dcigettext.c
+++ b/intl/dcigettext.c
@@ -691,9 +691,10 @@  DCIGETTEXT (const char *domainname, const char *msgid1, const char *msgid2,
 	    continue;
 	}
 
-      /* If the current locale value is C (or POSIX) we don't load a
-	 domain.  Return the MSGID.  */
-      if (strcmp (single_locale, "C") == 0
+      /* If the current locale value is "C" or "C.<encoding>" or "POSIX",
+	 we don't load a domain.  Return the MSGID.  */
+      if ((single_locale[0] == 'C'
+	   && (single_locale[1] == '\0' || single_locale[1] == '.'))
 	  || strcmp (single_locale, "POSIX") == 0)
 	break;