localedata: add new locales scn_IT and scn_US

Message ID 20240427233734.0cb6edff@betelgeuse.hanskalabs.net
State Changes Requested
Delegated to: Arjun Shankar
Headers
Series localedata: add new locales scn_IT and scn_US |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed

Commit Message

David Paleino April 27, 2024, 9:37 p.m. UTC
  Hello,
please consider merging the following patch, adding two new locales, scn_IT and
scn_US.

This is part of the ongoing effort in making Sicilian (ISO-639 scn) language
officially recognized as a minority language in Italy. The _US locale is
because the US is currently home to the majority of 2nd and 3rd generation
Sicilian-descendants. There are also vast communities in South America and
Australia, but that would be for another day really.

Thank you for considering,
David Paleino
President of Cademia Siciliana

---
 localedata/SUPPORTED      |   2 +
 localedata/locales/scn_IT | 138 ++++++++++++++++++++++++++++++++++++++
 localedata/locales/scn_US | 106 +++++++++++++++++++++++++++++
 3 files changed, 246 insertions(+)
 create mode 100644 localedata/locales/scn_IT
 create mode 100644 localedata/locales/scn_US
  

Comments

Florian Weimer April 29, 2024, 1:17 p.m. UTC | #1
* David Paleino:

> +translit_start
> +<U1E0C><U1E0C> "<U0044><U0044><U0048>"
> +<U1E0D><U1E0D> "<U0064><U0064><U0068>"
> +<U1E0C><U1E0D> "<U0044><U0064><U0068>"
> +translit_end

Please use UTF-8 for new locale definitions.

Is adding scn_US really necessary?  A similar argument could be made
about most languages.

Thanks,
Florian
  
David Paleino May 1, 2024, 12:09 a.m. UTC | #2
Hello,

On Mon, 29 Apr 2024 15:17:56 +0200, Florian Weimer wrote:

> * David Paleino:
> 
> > +translit_start
> > +<U1E0C><U1E0C> "<U0044><U0044><U0048>"
> > +<U1E0D><U1E0D> "<U0064><U0064><U0068>"
> > +<U1E0C><U1E0D> "<U0044><U0064><U0068>"
> > +translit_end  
> 
> Please use UTF-8 for new locale definitions.

Please find attached the revised patch.
 
> Is adding scn_US really necessary?  A similar argument could be made
> about most languages.

Currently, United States is probably the country hosting the biggest community
of Sicilian expats and their descendants, who might find it useful to have a
separate locale. We, as an association, have had actual demand for the locale
to be implemented. Probably second place goes to Latin America, but I'm only
proposing scn_US here because we already have a keyboard layout for that
particular combination.

I definitely understand that going down the rabbit hole of adding
<minority_language>_* can quickly become a nightmare, so please, if you prefer
scn_US to be dropped from the patch, just find a second patch attached, only
adding scn_IT.

Thank you,
David
  
Florian Weimer May 13, 2024, 1:23 p.m. UTC | #3
David, I completed the UTF-8 conversion.  Would you please double-check
that it's correct and resubmit as appropriate?

Thanks,
Florian
comment_char %
escape_char /

% This file is part of the GNU C Library and contains locale data.
% The Free Software Foundation does not claim any copyright interest
% in the locale data contained in this file.  The foregoing does not
% affect the license of the GNU C Library as a whole.  It does not
% exempt you from the conditions of the license if your use would
% otherwise be governed by that license.

% Sicilian Language Locale for Italy
% Source: Cademia Siciliana
% Address: Via Convento S.F. di Paola, 73
%    91100 Trapani, Italy
% Contact: David Paleino
% Email: david@cademiasiciliana.org
% Tel:
% Fax:
% Language: scn
% Territory: IT
% Revision: 1.0
% Date: 2024-04-27
% Users: general

LC_IDENTIFICATION
title      "Sicilian locale for Italy"
source     "Cademia Siciliana"
address    "Via Convento S.F. di Paola, 73, 91100 Trapani, Italy"
contact    ""
email      "tech@cademiasiciliana.org"
tel        ""
fax        ""
language   "Sicilian"
territory  "Italy"
revision   "1.0"
date       "2024-04-27"

category "i18n:2012";LC_IDENTIFICATION
category "i18n:2012";LC_CTYPE
category "i18n:2012";LC_COLLATE
category "i18n:2012";LC_TIME
category "i18n:2012";LC_NUMERIC
category "i18n:2012";LC_MONETARY
category "i18n:2012";LC_MESSAGES
category "i18n:2012";LC_PAPER
category "i18n:2012";LC_NAME
category "i18n:2012";LC_ADDRESS
category "i18n:2012";LC_TELEPHONE
category "i18n:2012";LC_MEASUREMENT
END LC_IDENTIFICATION

LC_COLLATE
copy "iso14651_t1"
END LC_COLLATE

LC_CTYPE
copy "it_IT"

translit_start
ḌḌ "DDH"
ḍḍ "ddh"
Ḍḍ "Ddh"
translit_end
END LC_CTYPE

LC_MESSAGES
yesexpr "^[+1sSyY]"
noexpr  "^[-0nN]"
yesstr  "se"
nostr   "no"
END LC_MESSAGES

LC_MONETARY
copy "it_IT"
END LC_MONETARY

LC_NUMERIC
copy "it_IT"
END LC_NUMERIC

LC_TIME
copy "it_IT"

abday   "dum";"lun";/
	"mar";"mer";/
	"jov";"ven";/
	"sab"
day     "dumìnica";/
	"lunnidìa";/
	"martidìa";/
	"mercuridìa";/
	"jovidìa";/
	"vènniri";/
	"sàbbatu"
abmon   "jin";"fri";/
	"mar";"apr";/
	"maj";"giu";/
	"gnt";"agu";/
	"sit";"utt";/
	"nuv";"dic"
mon     "jinnaru";/
	"frivaru";/
	"marzu";/
	"aprili";/
	"maju";/
	"giugnu";/
	"giugnettu";/
	"agustu";/
	"sittèmmiru";/
	"uttùviru";/
	"novèmmiru";/
	"dicèmmiru"
END LC_TIME

LC_PAPER
copy "it_IT"
END LC_PAPER

LC_TELEPHONE
copy "it_IT"
END LC_TELEPHONE

LC_MEASUREMENT
copy "it_IT"
END LC_MEASUREMENT

LC_NAME
copy "it_IT"
END LC_NAME

LC_ADDRESS
copy "it_IT"

lang_name    "sicilianu"
lang_ab      ""
lang_term    "scn"
lang_lib    "scn"
END LC_ADDRESS
  
David Paleino May 14, 2024, 8:58 p.m. UTC | #4
On Mon, 13 May 2024 15:23:59 +0200, Florian Weimer wrote:

> David, I completed the UTF-8 conversion.  Would you please double-check
> that it's correct and resubmit as appropriate?

Sorry, somehow I completely missed all the other conversions. Meh.

Final patch attached, thank you!
David
  
David Paleino May 14, 2024, 9:08 p.m. UTC | #5
On Tue, 14 May 2024 22:58:15 +0200, David Paleino wrote:

> [..]
> Final patch attached, thank you!

Meh, I see on the online archives it gets attached a binary blob(?!)

Putting it in simple text format, sorry for the noise.
David

From f6ac8098264dcc4d1666b80bcb96eeda7b7084cd Mon Sep 17 00:00:00 2001
From: David Paleino <dapal@debian.org>
Date: Sat, 27 Apr 2024 23:22:01 +0200
Subject: [PATCH] localedata: add new locale scn_IT

Signed-off-by: David Paleino <dapal@debian.org>
---
 localedata/SUPPORTED      |   1 +
 localedata/locales/scn_IT | 138 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)
 create mode 100644 localedata/locales/scn_IT

diff --git a/localedata/SUPPORTED b/localedata/SUPPORTED
index 759895cc3a..96ff43f8fd 100644
--- a/localedata/SUPPORTED
+++ b/localedata/SUPPORTED
@@ -394,6 +394,7 @@ sa_IN/UTF-8 \
 sah_RU/UTF-8 \
 sat_IN/UTF-8 \
 sc_IT/UTF-8 \
+scn_IT/UTF-8 \
 sd_IN/UTF-8 \
 sd_IN@devanagari/UTF-8 \
 se_NO/UTF-8 \
diff --git a/localedata/locales/scn_IT b/localedata/locales/scn_IT
new file mode 100644
index 0000000000..abf9b1e49f
--- /dev/null
+++ b/localedata/locales/scn_IT
@@ -0,0 +1,138 @@
+comment_char %
+escape_char /
+
+% This file is part of the GNU C Library and contains locale data.
+% The Free Software Foundation does not claim any copyright interest
+% in the locale data contained in this file.  The foregoing does not
+% affect the license of the GNU C Library as a whole.  It does not
+% exempt you from the conditions of the license if your use would
+% otherwise be governed by that license.
+
+% Sicilian Language Locale for Italy
+% Source: Cademia Siciliana
+% Address: Via Convento S.F. di Paola, 73
+%    91100 Trapani, Italy
+% Contact: David Paleino
+% Email: david@cademiasiciliana.org
+% Tel:
+% Fax:
+% Language: scn
+% Territory: IT
+% Revision: 1.0
+% Date: 2024-04-27
+% Users: general
+
+LC_IDENTIFICATION
+title      "Sicilian locale for Italy"
+source     "Cademia Siciliana"
+address    "Via Convento S.F. di Paola, 73, 91100 Trapani, Italy"
+contact    ""
+email      "tech@cademiasiciliana.org"
+tel        ""
+fax        ""
+language   "Sicilian"
+territory  "Italy"
+revision   "1.0"
+date       "2024-04-27"
+
+category "i18n:2012";LC_IDENTIFICATION
+category "i18n:2012";LC_CTYPE
+category "i18n:2012";LC_COLLATE
+category "i18n:2012";LC_TIME
+category "i18n:2012";LC_NUMERIC
+category "i18n:2012";LC_MONETARY
+category "i18n:2012";LC_MESSAGES
+category "i18n:2012";LC_PAPER
+category "i18n:2012";LC_NAME
+category "i18n:2012";LC_ADDRESS
+category "i18n:2012";LC_TELEPHONE
+category "i18n:2012";LC_MEASUREMENT
+END LC_IDENTIFICATION
+
+LC_COLLATE
+copy "iso14651_t1"
+END LC_COLLATE
+
+LC_CTYPE
+copy "it_IT"
+
+translit_start
+ḌḌ "DDH"
+ḍḍ "ddh"
+Ḍḍ "Ddh"
+translit_end
+END LC_CTYPE
+
+LC_MESSAGES
+yesexpr "^[+1sSyY]"
+noexpr  "^[-0nN]"
+yesstr  "se"
+nostr   "no"
+END LC_MESSAGES
+
+LC_MONETARY
+copy "it_IT"
+END LC_MONETARY
+
+LC_NUMERIC
+copy "it_IT"
+END LC_NUMERIC
+
+LC_TIME
+copy "it_IT"
+
+abday   "dum";"lun";/
+	"mar";"mer";/
+	"jov";"ven";/
+	"sab"
+day     "dumìnica";/
+	"lunnidìa";/
+	"martidìa";/
+	"mercuridìa";/
+	"jovidìa";/
+	"venniridìa";/
+	"sàbbatu"
+abmon   "jin";"fri";/
+	"mar";"apr";/
+	"maj";"giu";/
+	"gnt";"agu";/
+	"sit";"utt";/
+	"nuv";"dic"
+mon     "jinnaru";/
+	"frivaru";/
+	"marzu";/
+	"aprili";/
+	"maju";/
+	"giugnu";/
+	"giugnettu";/
+	"agustu";/
+	"sittèmmiru";/
+	"uttùviru";/
+	"nuvèmmiru";/
+	"dicèmmiru"
+END LC_TIME
+
+LC_PAPER
+copy "it_IT"
+END LC_PAPER
+
+LC_TELEPHONE
+copy "it_IT"
+END LC_TELEPHONE
+
+LC_MEASUREMENT
+copy "it_IT"
+END LC_MEASUREMENT
+
+LC_NAME
+copy "it_IT"
+END LC_NAME
+
+LC_ADDRESS
+copy "it_IT"
+
+lang_name    "sicilianu"
+lang_ab      ""
+lang_term    "scn"
+lang_lib    "scn"
+END LC_ADDRESS
  
Florian Weimer May 15, 2024, 4:37 a.m. UTC | #6
* David Paleino:

> On Tue, 14 May 2024 22:58:15 +0200, David Paleino wrote:
>
>> [..]
>> Final patch attached, thank you!
>
> Meh, I see on the online archives it gets attached a binary blob(?!)

It shows up in  the alternative archives:

  <https://inbox.sourceware.org/libc-alpha/20240514230847.20b64f52@betelgeuse.hanskalabs.net/T/#mfccd4453d0e6706770cfa26e04b8fa7ec2b2995a>

Thanks,
Florian
  
Mike FABIAN May 15, 2024, 3:17 p.m. UTC | #7
David Paleino <dapal@debian.org> さんはかきました:

> +LC_TIME
> +copy "it_IT"   <- problem here
> +
> +abday   "dum";"lun";/
> +	"mar";"mer";/
> +	"jov";"ven";/
> +	"sab"
> +day     "dumìnica";/
> +	"lunnidìa";/
> +	"martidìa";/
> +	"mercuridìa";/
> +	"jovidìa";/
> +	"venniridìa";/
> +	"sàbbatu"
> +abmon   "jin";"fri";/
> +	"mar";"apr";/
> +	"maj";"giu";/
> +	"gnt";"agu";/
> +	"sit";"utt";/
> +	"nuv";"dic"
> +mon     "jinnaru";/
> +	"frivaru";/
> +	"marzu";/
> +	"aprili";/
> +	"maju";/
> +	"giugnu";/
> +	"giugnettu";/
> +	"agustu";/
> +	"sittèmmiru";/
> +	"uttùviru";/
> +	"nuvèmmiru";/
> +	"dicèmmiru"
> +END LC_TIME

> +LC_ADDRESS
> +copy "it_IT"    <- problem here
> +
> +lang_name    "sicilianu"
> +lang_ab      ""
> +lang_term    "scn"
> +lang_lib    "scn"
> +END LC_ADDRESS

$ localedef -f UTF-8 -i scn_IT /tmp/sci_IT.UTF-8
scn_IT:83: no other keyword shall be specified when `copy' is used
scn_IT:133: no other keyword shall be specified when `copy' is used
  
David Paleino May 15, 2024, 6:39 p.m. UTC | #8
On Wed, 15 May 2024 17:17:19 +0200, Mike FABIAN wrote:

> [..]
> 
> $ localedef -f UTF-8 -i scn_IT /tmp/sci_IT.UTF-8
> scn_IT:83: no other keyword shall be specified when `copy' is used
> scn_IT:133: no other keyword shall be specified when `copy' is used

Fixed, thank you.
David

From d5d51e4a162fe3e0057a03f0412e910ab15c0522 Mon Sep 17 00:00:00 2001
From: David Paleino <dapal@debian.org>
Date: Sat, 27 Apr 2024 23:22:01 +0200
Subject: [PATCH] localedata: add new locale scn_IT

Signed-off-by: David Paleino <dapal@debian.org>
---
 localedata/SUPPORTED      |   1 +
 localedata/locales/scn_IT | 150 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 151 insertions(+)
 create mode 100644 localedata/locales/scn_IT

diff --git a/localedata/SUPPORTED b/localedata/SUPPORTED
index 759895cc3a..96ff43f8fd 100644
--- a/localedata/SUPPORTED
+++ b/localedata/SUPPORTED
@@ -394,6 +394,7 @@ sa_IN/UTF-8 \
 sah_RU/UTF-8 \
 sat_IN/UTF-8 \
 sc_IT/UTF-8 \
+scn_IT/UTF-8 \
 sd_IN/UTF-8 \
 sd_IN@devanagari/UTF-8 \
 se_NO/UTF-8 \
diff --git a/localedata/locales/scn_IT b/localedata/locales/scn_IT
new file mode 100644
index 0000000000..6161c529fb
--- /dev/null
+++ b/localedata/locales/scn_IT
@@ -0,0 +1,150 @@
+comment_char %
+escape_char /
+
+% This file is part of the GNU C Library and contains locale data.
+% The Free Software Foundation does not claim any copyright interest
+% in the locale data contained in this file.  The foregoing does not
+% affect the license of the GNU C Library as a whole.  It does not
+% exempt you from the conditions of the license if your use would
+% otherwise be governed by that license.
+
+% Sicilian Language Locale for Italy
+% Source: Cademia Siciliana
+% Address: Via Convento S.F. di Paola, 73
+%    91100 Trapani, Italy
+% Contact: David Paleino
+% Email: david@cademiasiciliana.org
+% Tel:
+% Fax:
+% Language: scn
+% Territory: IT
+% Revision: 1.0
+% Date: 2024-04-27
+% Users: general
+
+LC_IDENTIFICATION
+title      "Sicilian locale for Italy"
+source     "Cademia Siciliana"
+address    "Via Convento S.F. di Paola, 73, 91100 Trapani, Italy"
+contact    ""
+email      "tech@cademiasiciliana.org"
+tel        ""
+fax        ""
+language   "Sicilian"
+territory  "Italy"
+revision   "1.0"
+date       "2024-04-27"
+
+category "i18n:2012";LC_IDENTIFICATION
+category "i18n:2012";LC_CTYPE
+category "i18n:2012";LC_COLLATE
+category "i18n:2012";LC_TIME
+category "i18n:2012";LC_NUMERIC
+category "i18n:2012";LC_MONETARY
+category "i18n:2012";LC_MESSAGES
+category "i18n:2012";LC_PAPER
+category "i18n:2012";LC_NAME
+category "i18n:2012";LC_ADDRESS
+category "i18n:2012";LC_TELEPHONE
+category "i18n:2012";LC_MEASUREMENT
+END LC_IDENTIFICATION
+
+LC_COLLATE
+copy "iso14651_t1"
+END LC_COLLATE
+
+LC_CTYPE
+copy "it_IT"
+
+translit_start
+ḌḌ "DDH"
+ḍḍ "ddh"
+Ḍḍ "Ddh"
+translit_end
+END LC_CTYPE
+
+LC_MESSAGES
+yesexpr "^[+1sSyY]"
+noexpr  "^[-0nN]"
+yesstr  "se"
+nostr   "no"
+END LC_MESSAGES
+
+LC_MONETARY
+copy "it_IT"
+END LC_MONETARY
+
+LC_NUMERIC
+copy "it_IT"
+END LC_NUMERIC
+
+LC_TIME
+abday   "dum";"lun";/
+	"mar";"mer";/
+	"jov";"ven";/
+	"sab"
+day     "dumìnica";/
+	"lunnidìa";/
+	"martidìa";/
+	"mercuridìa";/
+	"jovidìa";/
+	"venniridìa";/
+	"sàbbatu"
+abmon   "jin";"fri";/
+	"mar";"apr";/
+	"maj";"giu";/
+	"gnt";"agu";/
+	"sit";"utt";/
+	"nuv";"dic"
+mon     "jinnaru";/
+	"frivaru";/
+	"marzu";/
+	"aprili";/
+	"maju";/
+	"giugnu";/
+	"giugnettu";/
+	"agustu";/
+	"sittèmmiru";/
+	"uttùviru";/
+	"nuvèmmiru";/
+	"dicèmmiru"
+d_t_fmt "%a %-d %b %Y, %T"
+d_fmt   "%d//%m//%Y"
+t_fmt   "%T"
+am_pm   "";""
+t_fmt_ampm ""
+date_fmt   "%a %-d %b %Y, %T, %Z"
+week 7;19971130;4
+first_weekday 2
+END LC_TIME
+
+LC_PAPER
+copy "it_IT"
+END LC_PAPER
+
+LC_TELEPHONE
+copy "it_IT"
+END LC_TELEPHONE
+
+LC_MEASUREMENT
+copy "it_IT"
+END LC_MEASUREMENT
+
+LC_NAME
+copy "it_IT"
+END LC_NAME
+
+LC_ADDRESS
+postal_fmt    "%f%N%a%N%d%N%b%N%s %h %e %r%N%z %T%N%c%N"
+country_name "Italia"
+country_ab2 "IT"
+country_ab3 "ITA"
+country_num 380
+country_isbn "978-88,979-12"
+country_car  "I"
+
+lang_name    "sicilianu"
+lang_ab      ""
+lang_term    "scn"
+lang_lib    "scn"
+END LC_ADDRESS
  
Mike FABIAN May 16, 2024, 7:38 a.m. UTC | #9
David Paleino <dapal@debian.org> さんはかきました:

> +LC_CTYPE
> +copy "it_IT"
> +
> +translit_start
> +ḌḌ "DDH"
> +ḍḍ "ddh"
> +Ḍḍ "Ddh"
> +translit_end
> +END LC_CTYPE

I am sorry for not testing that earlier, but that translit part does not
seem to work:

bash-5.2# export LC_ALL=scn_IT.UTF-8 
bash-5.2# echo 'ḌḌ' | iconv -f UTF-8 -t ASCII//translit 
??
bash-5.2# echo 'ß' | iconv -f UTF-8 -t ASCII//translit 
ss
bash-5.2#
  
Mike FABIAN May 16, 2024, 8:08 a.m. UTC | #10
Mike FABIAN <mfabian@redhat.com> さんはかきました:

> David Paleino <dapal@debian.org> さんはかきました:
>
>> +LC_CTYPE
>> +copy "it_IT"
>> +
>> +translit_start
>> +ḌḌ "DDH"
>> +ḍḍ "ddh"
>> +Ḍḍ "Ddh"
>> +translit_end
>> +END LC_CTYPE
>
> I am sorry for not testing that earlier, but that translit part does not
> seem to work:
>
> bash-5.2# export LC_ALL=scn_IT.UTF-8 
> bash-5.2# echo 'ḌḌ' | iconv -f UTF-8 -t ASCII//translit 
> ??
> bash-5.2# echo 'ß' | iconv -f UTF-8 -t ASCII//translit 
> ss
> bash-5.2# 


With single input characters the transliteration works, i.e. something
like this works:

LC_CTYPE
copy "it_IT"

translit_start
Ḍ "D"
ḍ "d"
translit_end
END LC_CTYPE

bash-5.2# export LC_ALL=scn_IT.UTF-8 
bash-5.2# echo 'Ḍ' | iconv -f UTF-8 -t ASCII//translit 
D
bash-5.2# 

I think glibc can only transliterate single input characters into an
output string, it most likely cannot transliterate a multi-character
input string into something at the moment.
  
Andreas Schwab May 16, 2024, 8:44 a.m. UTC | #11
On Mai 16 2024, Mike FABIAN wrote:

> I think glibc can only transliterate single input characters into an
> output string, it most likely cannot transliterate a multi-character
> input string into something at the moment.

AFAICT, it should work with multi-character transliterations.
  
Florian Weimer May 16, 2024, 9:47 a.m. UTC | #12
* Andreas Schwab:

> On Mai 16 2024, Mike FABIAN wrote:
>
>> I think glibc can only transliterate single input characters into an
>> output string, it most likely cannot transliterate a multi-character
>> input string into something at the moment.
>
> AFAICT, it should work with multi-character transliterations.

How does this work reliably if there is an inconvenient iconv buffer
boundary?

(Not saying this is the case here.)

Thanks,
Florian
  
Andreas Schwab May 16, 2024, 10:20 a.m. UTC | #13
On Mai 16 2024, Mike FABIAN wrote:

> David Paleino <dapal@debian.org> さんはかきました:
>
>> +LC_CTYPE
>> +copy "it_IT"
>> +
>> +translit_start
>> +ḌḌ "DDH"
>> +ḍḍ "ddh"
>> +Ḍḍ "Ddh"
>> +translit_end
>> +END LC_CTYPE
>
> I am sorry for not testing that earlier, but that translit part does not
> seem to work:
>
> bash-5.2# export LC_ALL=scn_IT.UTF-8 
> bash-5.2# echo 'ḌḌ' | iconv -f UTF-8 -t ASCII//translit 
> ??

There is already a transliteration for Ḍ which takes precedence.
  
Andreas Schwab May 16, 2024, 10:22 a.m. UTC | #14
On Mai 16 2024, Florian Weimer wrote:

> * Andreas Schwab:
>
>> On Mai 16 2024, Mike FABIAN wrote:
>>
>>> I think glibc can only transliterate single input characters into an
>>> output string, it most likely cannot transliterate a multi-character
>>> input string into something at the moment.
>>
>> AFAICT, it should work with multi-character transliterations.
>
> How does this work reliably if there is an inconvenient iconv buffer
> boundary?

__gconv_transliterate returns __GCONV_INCOMPLETE_INPUT.
  
Andreas Schwab May 16, 2024, 11 a.m. UTC | #15
On Mai 16 2024, Andreas Schwab wrote:

> On Mai 16 2024, Mike FABIAN wrote:
>
>> David Paleino <dapal@debian.org> さんはかきました:
>>
>>> +LC_CTYPE
>>> +copy "it_IT"
>>> +
>>> +translit_start
>>> +ḌḌ "DDH"
>>> +ḍḍ "ddh"
>>> +Ḍḍ "Ddh"
>>> +translit_end
>>> +END LC_CTYPE
>>
>> I am sorry for not testing that earlier, but that translit part does not
>> seem to work:
>>
>> bash-5.2# export LC_ALL=scn_IT.UTF-8 
>> bash-5.2# echo 'ḌḌ' | iconv -f UTF-8 -t ASCII//translit 
>> ??
>
> There is already a transliteration for Ḍ which takes precedence.

Actually, the entries above replace the ones from translit_combining,
but they are interpreted as "Ḍ" -> "DDH" and "ḍ" -> "ḍddh" (the third
entry is ignored).  The proper syntax would be

translit_start
"ḌḌ" "DDH"
"ḍḍ" "ddh"
"Ḍḍ" "Ddh"
translit_end

but depending on the how the binary search goes on, either these or the
shorter matches will win.
  
Mike FABIAN May 16, 2024, 3:03 p.m. UTC | #16
Andreas Schwab <schwab@suse.de> さんはかきました:

> On Mai 16 2024, Andreas Schwab wrote:
>
>> On Mai 16 2024, Mike FABIAN wrote:
>>
>>> David Paleino <dapal@debian.org> さんはかきました:
>>>
>>>> +LC_CTYPE
>>>> +copy "it_IT"
>>>> +
>>>> +translit_start
>>>> +ḌḌ "DDH"
>>>> +ḍḍ "ddh"
>>>> +Ḍḍ "Ddh"
>>>> +translit_end
>>>> +END LC_CTYPE
>>>
>>> I am sorry for not testing that earlier, but that translit part does not
>>> seem to work:
>>>
>>> bash-5.2# export LC_ALL=scn_IT.UTF-8 
>>> bash-5.2# echo 'ḌḌ' | iconv -f UTF-8 -t ASCII//translit 
>>> ??
>>
>> There is already a transliteration for Ḍ which takes precedence.
>
> Actually, the entries above replace the ones from translit_combining,
> but they are interpreted as "Ḍ" -> "DDH" and "ḍ" -> "ḍddh" (the third
> entry is ignored).  The proper syntax would be
>
> translit_start
> "ḌḌ" "DDH"
> "ḍḍ" "ddh"
> "Ḍḍ" "Ddh"
> translit_end

Thank you! I have tried with that syntax now but could not make it work.

> but depending on the how the binary search goes on, either these or the
> shorter matches will win.

Does it depend on the exact input how the binary search goes on?
I tried several inputs and for me the shorter matches did always win.

Then I uncommented the shorter matches in translit_combining like this:

diff --git a/localedata/locales/translit_combining b/localedata/locales/translit_combining
index ce2f19eee1..6f879d9caf 100644
--- a/localedata/locales/translit_combining
+++ b/localedata/locales/translit_combining
@@ -2486,9 +2486,9 @@ translit_start
 % LATIN SMALL LETTER D WITH DOT ABOVE
 <U1E0B> <U0064>
 % LATIN CAPITAL LETTER D WITH DOT BELOW
-<U1E0C> <U0044>
+%<U1E0C> <U0044>
 % LATIN SMALL LETTER D WITH DOT BELOW
-<U1E0D> <U0064>
+%<U1E0D> <U0064>
 % LATIN CAPITAL LETTER D WITH LINE BELOW
 <U1E0E> <U0044>
 % LATIN SMALL LETTER D WITH LINE BELOW

and after doing that,

bash-5.2# echo 'ḌḌ'|iconv -f UTF-8 -t ASCII//translit
^C
bash-5.2#

uses 100% CPU and never stops until I stop it with Control-C.
  
Andreas Schwab May 16, 2024, 3:20 p.m. UTC | #17
On Mai 16 2024, Mike FABIAN wrote:

> Does it depend on the exact input how the binary search goes on?

It depends on the translit data, how the midway point moves through it.

> bash-5.2# echo 'ḌḌ'|iconv -f UTF-8 -t ASCII//translit
> ^C
> bash-5.2#
>
> uses 100% CPU and never stops until I stop it with Control-C.

      else if (cnt > 0)
	/* This means that the input buffer contents matches a prefix of
	   an entry.  Since we cannot match it unless we get more input,
	   we will tell the caller about it.  */
	return __GCONV_INCOMPLETE_INPUT;

This should only return when the end of the input string is reached,
otherwise it's a non-match and it should go on to try other translit
patterns.
  
Mike FABIAN June 9, 2024, 7:20 a.m. UTC | #18
Andreas Schwab <schwab@suse.de> さんはかきました:

> On Mai 16 2024, Mike FABIAN wrote:
>
>> Does it depend on the exact input how the binary search goes on?
>
> It depends on the translit data, how the midway point moves through it.
>
>> bash-5.2# echo 'ḌḌ'|iconv -f UTF-8 -t ASCII//translit
>> ^C
>> bash-5.2#
>>
>> uses 100% CPU and never stops until I stop it with Control-C.
>
>       else if (cnt > 0)
> 	/* This means that the input buffer contents matches a prefix of
> 	   an entry.  Since we cannot match it unless we get more input,
> 	   we will tell the caller about it.  */
> 	return __GCONV_INCOMPLETE_INPUT;
>
> This should only return when the end of the input string is reached,
> otherwise it's a non-match and it should go on to try other translit
> patterns.

I pushed the new scn_IT locale to master but with the translit part
commented out and reported a bug for the translit problem:

https://sourceware.org/bugzilla/show_bug.cgi?id=31859
  

Patch

diff --git a/localedata/SUPPORTED b/localedata/SUPPORTED
index a2f3132480..b87aeda7e1 100644
--- a/localedata/SUPPORTED
+++ b/localedata/SUPPORTED
@@ -393,6 +393,8 @@  sa_IN/UTF-8 \
 sah_RU/UTF-8 \
 sat_IN/UTF-8 \
 sc_IT/UTF-8 \
+scn_IT/UTF-8 \
+scn_US/UTF-8 \
 sd_IN/UTF-8 \
 sd_IN@devanagari/UTF-8 \
 se_NO/UTF-8 \
diff --git a/localedata/locales/scn_IT b/localedata/locales/scn_IT
new file mode 100644
index 0000000000..5c4ee44917
--- /dev/null
+++ b/localedata/locales/scn_IT
@@ -0,0 +1,138 @@ 
+comment_char %
+escape_char /
+
+% This file is part of the GNU C Library and contains locale data.
+% The Free Software Foundation does not claim any copyright interest
+% in the locale data contained in this file.  The foregoing does not
+% affect the license of the GNU C Library as a whole.  It does not
+% exempt you from the conditions of the license if your use would
+% otherwise be governed by that license.
+
+% Sicilian Language Locale for Italy
+% Source: Cademia Siciliana
+% Address: Via Convento S.F. di Paola, 73
+%    91100 Trapani, Italy
+% Contact: David Paleino
+% Email: david@cademiasiciliana.org
+% Tel:
+% Fax:
+% Language: scn
+% Territory: IT
+% Revision: 1.0
+% Date: 2024-04-27
+% Users: general
+
+LC_IDENTIFICATION
+title      "Sicilian locale for Italy"
+source     "Cademia Siciliana"
+address    "Via Convento S.F. di Paola, 73, 91100 Trapani, Italy"
+contact    ""
+email      "tech@cademiasiciliana.org"
+tel        ""
+fax        ""
+language   "Sicilian"
+territory  "Italy"
+revision   "1.0"
+date       "2024-04-27"
+
+category "i18n:2012";LC_IDENTIFICATION
+category "i18n:2012";LC_CTYPE
+category "i18n:2012";LC_COLLATE
+category "i18n:2012";LC_TIME
+category "i18n:2012";LC_NUMERIC
+category "i18n:2012";LC_MONETARY
+category "i18n:2012";LC_MESSAGES
+category "i18n:2012";LC_PAPER
+category "i18n:2012";LC_NAME
+category "i18n:2012";LC_ADDRESS
+category "i18n:2012";LC_TELEPHONE
+category "i18n:2012";LC_MEASUREMENT
+END LC_IDENTIFICATION
+
+LC_COLLATE
+copy "iso14651_t1"
+END LC_COLLATE
+
+LC_CTYPE
+copy "it_IT"
+
+translit_start
+<U1E0C><U1E0C> "<U0044><U0044><U0048>"
+<U1E0D><U1E0D> "<U0064><U0064><U0068>"
+<U1E0C><U1E0D> "<U0044><U0064><U0068>"
+translit_end
+END LC_CTYPE
+
+LC_MESSAGES
+yesexpr "^[+1sSyY]"
+noexpr  "^[-0nN]"
+yesstr  "se"
+nostr   "no"
+END LC_MESSAGES
+
+LC_MONETARY
+copy "it_IT"
+END LC_MONETARY
+
+LC_NUMERIC
+copy "it_IT"
+END LC_NUMERIC
+
+LC_TIME
+copy "it_IT"
+
+abday   "dum";"lun";/
+	"mar";"mer";/
+	"jov";"ven";/
+	"sab"
+day     "dum<U00EC>nica";/
+	"lunnid<U00EC>a";/
+	"martid<U00EC>a";/
+	"mercurid<U00EC>a";/
+	"jovid<U00EC>a";/
+	"v<U00E8>nniri";/
+	"s<U00E0>bbatu"
+abmon   "jin";"fri";/
+	"mar";"apr";/
+	"maj";"giu";/
+	"gnt";"agu";/
+	"sit";"utt";/
+	"nuv";"dic"
+mon     "jinnaru";/
+	"frivaru";/
+	"marzu";/
+	"aprili";/
+	"maju";/
+	"giugnu";/
+	"giugnettu";/
+	"agustu";/
+	"sitt<U00E8>mmiru";/
+	"utt<U00F9>viru";/
+	"nov<U00E8>mmiru";/
+	"dic<U00E8>mmiru"
+END LC_TIME
+
+LC_PAPER
+copy "it_IT"
+END LC_PAPER
+
+LC_TELEPHONE
+copy "it_IT"
+END LC_TELEPHONE
+
+LC_MEASUREMENT
+copy "it_IT"
+END LC_MEASUREMENT
+
+LC_NAME
+copy "it_IT"
+END LC_NAME
+
+LC_ADDRESS
+copy "it_IT"
+
+lang_name    "sicilianu"
+lang_ab      ""
+lang_term    "scn"
+lang_lib    "scn"
+END LC_ADDRESS
diff --git a/localedata/locales/scn_US b/localedata/locales/scn_US
new file mode 100644
index 0000000000..834174d1dd
--- /dev/null
+++ b/localedata/locales/scn_US
@@ -0,0 +1,106 @@ 
+comment_char %
+escape_char /
+
+% This file is part of the GNU C Library and contains locale data.
+% The Free Software Foundation does not claim any copyright interest
+% in the locale data contained in this file.  The foregoing does not
+% affect the license of the GNU C Library as a whole.  It does not
+% exempt you from the conditions of the license if your use would
+% otherwise be governed by that license.
+
+% Sicilian Language Locale for the USA
+% Source: Cademia Siciliana
+% Address: Via Convento S.F. di Paola, 73
+%    91100 Trapani, Italy
+% Contact: David Paleino
+% Email: david@cademiasiciliana.org
+% Tel:
+% Fax:
+% Language: scn
+% Territory: USA
+% Revision: 1.0
+% Date: 2024-04-27
+% Users: general
+
+LC_IDENTIFICATION
+title      "Sicilian locale for Italy"
+source     "Cademia Siciliana"
+address    "Via Convento S.F. di Paola, 73, 91100 Trapani, Italy"
+contact    ""
+email      "tech@cademiasiciliana.org"
+tel        ""
+fax        ""
+language   "Sicilian"
+territory  "United States"
+revision   "1.0"
+date       "2024-04-27"
+
+category "i18n:2012";LC_IDENTIFICATION
+category "i18n:2012";LC_CTYPE
+category "i18n:2012";LC_COLLATE
+category "i18n:2012";LC_TIME
+category "i18n:2012";LC_NUMERIC
+category "i18n:2012";LC_MONETARY
+category "i18n:2012";LC_MESSAGES
+category "i18n:2012";LC_PAPER
+category "i18n:2012";LC_NAME
+category "i18n:2012";LC_ADDRESS
+category "i18n:2012";LC_TELEPHONE
+category "i18n:2012";LC_MEASUREMENT
+END LC_IDENTIFICATION
+
+LC_COLLATE
+copy "iso14651_t1"
+END LC_COLLATE
+
+LC_CTYPE
+copy "scn_IT"
+END LC_CTYPE
+
+LC_MESSAGES
+copy "scn_IT"
+END LC_MESSAGES
+
+LC_MONETARY
+copy "en_US"
+END LC_MONETARY
+
+LC_NUMERIC
+copy "en_US"
+END LC_NUMERIC
+
+LC_TIME
+copy "scn_IT"
+
+week 7;19971130;1
+d_t_fmt "%a %d %b %Y %r %Z"
+d_fmt   "%m//%d//%Y"
+t_fmt   "%r"
+t_fmt_ampm "%I:%M:%S %p"
+date_fmt "%a %b %e %r %Z %Y"
+am_pm   "AM";"PM"
+END LC_TIME
+
+LC_PAPER
+copy "en_US"
+END LC_PAPER
+
+LC_TELEPHONE
+copy "en_US"
+END LC_TELEPHONE
+
+LC_MEASUREMENT
+copy "en_US"
+END LC_MEASUREMENT
+
+LC_NAME
+copy "en_US"
+END LC_NAME
+
+LC_ADDRESS
+copy "en_US"
+lang_name    "sicilianu"
+lang_ab      ""
+lang_term    "scn"
+lang_lib    "scn"
+END LC_ADDRESS