Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment using unicode decomposition & regex char ranges #45

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

missinglink
Copy link

@missinglink missinglink commented Jul 18, 2023

DRAFT: this is not really intended for merging but instead as a discussion point regarding how we might be able to identify accents without manually enumerating them all.

The tl;dr is:

string
    .normalize('NFKD')
    .replace(COMBINING_MARKS, '')
    .normalize('NFKC')

There is a description of the method in #44 (comment) and some more related discussion in #12 (comment), the ranges have been lifted from another project I worked on.

I'd like to open up a chat about this method, I think it's quite interesting, all the tests pass except for the one which enumerates a long list of characters.

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

# remove accents from string
not ok 1 should be equivalent
  ---
    operator: deepEqual
    expected: |-
      'AAAAAAAAAEAACCEEEEEEEEIIIIIDNOOOOOOOOOUUUUYaaaaaaaaaeaacceeeeeeeeiiiiinooooooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgGgHhHhIiIiIiIiIiIJijJjKkKkLlLlLlLlllMmNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuUuUuAaAEaeOodTHthPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeEeHhIiIiMmOoQqUuXxZzss'
    actual: |-
      'AAAAAAAAÆAACCEEEEEEEEIIIIIÐNOOOOOØOOOUUUUYaaaaaaaaæaacceeeeeeeeiiiiinoooooøooouuuuyyAaAaAaCcCcCcCcDdĐđEeEeEeEeEeGgGgGgGgGgHhĦħIiIiIiIiIıIJijJjKkKkLlLlLlL·l·ŁłMmNnNnNnʼnOoOoOoŒœRrRrRrSsSsSsSsTtTtŦŧUuUuUuUuUuUuWwWwYyYZzZzZzsƒOoUuAaIiOoUuUuUuUuUuUuUuAaÆæØøðÞþPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeƐɛHhIiƗɨMmOoQqUuXxZzß'

edit: sorry about the formatting, my editor seems to have automatically applied the Standard JS style, I can revert those change if we decide to proceed with it.

@missinglink
Copy link
Author

Note that the regenerate dependency can be removed in favour of the pattern it generates, namely:

[\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u200D\u20D0-\u20FF\u3099\u309A\uFE00-\uFE0F\uFE20-\uFE2F]

@tyxla
Copy link
Owner

tyxla commented Jul 20, 2023

Thanks for the PR, @missinglink 🙌

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

Well, we intentionally added replacement for these characters and it's a good example of why such a library is preferred to using String.normalize().

That being said, I'd welcome a simplification of the current approach that supports all current characters that we replace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants