Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using qualified language names instead of language code #1631

Open
Omikhleia opened this issue Nov 26, 2022 · 5 comments
Open

Using qualified language names instead of language code #1631

Omikhleia opened this issue Nov 26, 2022 · 5 comments
Labels
enhancement Software improvement or feature request
Milestone

Comments

@Omikhleia
Copy link
Member

Omikhleia commented Nov 26, 2022

Most of the code (and the manual also states it) assumes document.language is an ISO 639 language code (e.g. fr, en...). There are a number of cases where this is not sufficient for actual typography.

  • Hyphenation. E.g. Serbian is "sr", which currently has Cyrillic hyphenation patterns, but in written form the language is "digraphic", using the Cyrillic ("sr-Cyrl") or Latin ("sr-Latn") script. Likewise for Azeri, and a bunch of others.
  • Internationalization (fluent and friends)
    • For the same reason as above, obviously
    • But also because various regional use of a base language might have other habits. E.g. "es" and "pt" vs resp. "es-AR", "pt-BR", etc.
  • Number formatting Number formatting in foreign languages #1630 (and, slightly related, Folio numbering needs to follow language #1248 - for cases when either latin digits or a different native script are used)
  • Smart typography. E.g. see smartquotes.sile (a dependency of my markdown package): "fr-FR" and "fr-CH" would need being distinguished, but also "en-US" and "en-UK".
  • ...

This also indirectly relates to #1367 and #1157.

Near duplicate of #1368.

@Omikhleia Omikhleia changed the title Using qualified language names instead of mere language code Using qualified language names instead of language code Nov 26, 2022
@alerque
Copy link
Member

alerque commented Nov 29, 2022

Yes, all true. In fact I thought I had an issue for tracking this but I don't see it. I suppose this comment is what I was thinking of.

The next step is probably to figure out if changing document.language to a region qualified language identifier by default is going to be a breaking change.

@Omikhleia
Copy link
Member Author

Omikhleia commented Dec 1, 2022

Breaking for the user? No, it shouldn't.
I've been using "en-GB" and "fr-CA" for the fun, with just one clever but ugly hack to the existing code.
The question now is how far are we ready to go for refactoring the internal logic of several things to avoid the really ugly hack... I can push a draft branch by the week-end, annotated with comments, if you think that might help ascertaining the problem and finding a good solution.

@alerque
Copy link
Member

alerque commented Jun 14, 2024

Breaking for the user? No, it shouldn't.
I've been using "en-GB" and "fr-CA" for the fun,

I'd love for you to be right here, but I'm having trouble visualizing it. Using fully qualified names like en_GB in a document and shimming it to work in SILE with 'simple' names is a relatively easy automatic downgrade. I'm having trouble visualizing the other way around where a document (like almost all of them in existence right now) specify a simple name and we need to upgrade it to a fully qualified one. In order for this to not be a breaking change we'll need a function to cast up an otherwise ambiguous language code into the most likely fully qualified name. No? Doable, just not simple. Or am I missing something here?

@Omikhleia
Copy link
Member Author

Using fully qualified names like en_GB

By the way, that would be en-GB if we stick to the BCP47 format -- which I would recommend, it is what the Web standards mostly use, and it's a slightly different format than "locale codes" (what your en_GB could be); and there are rules for mapping one to the other (as well as canonicalization rule, and this is what our existing ICU wrapper does actually).

I'm having trouble visualizing the other way around where a document (like almost all of them in existence right now) specify a simple name and we need to upgrade it to a fully qualified one.

But we don't necessarily have to upgrade documents, en is valid BCP47 for "Standard English" -- one only needs to upgrade to, say en-US, en-GB or en-CA in order to enable features specific to the variants (if any exist), but the bare 2-letter code is still valid (usually considered to mean en-US).

In most cases, the 2-letter code is the canonical form of the "main language", e.g. fr is French for France (hence fr-FR does not really exists per se, but fr-CH, fr-CA etc. do have a meaning); es is always understood as es-ES (Castillan) and only needs extra qualification when referring to a variant such as es-MX (Spanish from Mexico)

In other terms, it seems to me that the crux of the matter is not to enforce fully qualified names (you wouldn't want to enforce the use of the very qualified but cumbersome en-Latn-US for standard English in Latin script) but just to support them, with fallback to the shortest supported form (which is what my WIP PR #1641 did).

There are only a few cases where the non-qualified name is ambiguous (sr could be sr-Latn or sr-Cyrl) but there is usually a default interpretation.

Or did I misunderstand your question?

@Omikhleia
Copy link
Member Author

Omikhleia commented Jun 15, 2024

And I stand corrected:

image

Sometimes, we might have to map reciprocally a 2-letter language "pt" to something "more qualified", just because the files we may need to load want it.

(For the curious-minded, this screenshot is from the CSL locales, using BCP47 but with even extra explicit qualification)

@alerque alerque added this to the v0.15.x milestone Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Software improvement or feature request
Projects
Status: In Progress
Development

No branches or pull requests

2 participants