experiment using unicode decomposition & regex char ranges #45

missinglink · 2023-07-18T18:54:56Z

DRAFT: this is not really intended for merging but instead as a discussion point regarding how we might be able to identify accents without manually enumerating them all.

The tl;dr is:

string
    .normalize('NFKD')
    .replace(COMBINING_MARKS, '')
    .normalize('NFKC')

There is a description of the method in #44 (comment) and some more related discussion in #12 (comment), the ranges have been lifted from another project I worked on.

I'd like to open up a chat about this method, I think it's quite interesting, all the tests pass except for the one which enumerates a long list of characters.

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

# remove accents from string
not ok 1 should be equivalent
  ---
    operator: deepEqual
    expected: |-
      'AAAAAAAAAEAACCEEEEEEEEIIIIIDNOOOOOOOOOUUUUYaaaaaaaaaeaacceeeeeeeeiiiiinooooooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgGgHhHhIiIiIiIiIiIJijJjKkKkLlLlLlLlllMmNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuUuUuAaAEaeOodTHthPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeEeHhIiIiMmOoQqUuXxZzss'
    actual: |-
      'AAAAAAAAÆAACCEEEEEEEEIIIIIÐNOOOOOØOOOUUUUYaaaaaaaaæaacceeeeeeeeiiiiinoooooøooouuuuyyAaAaAaCcCcCcCcDdĐđEeEeEeEeEeGgGgGgGgGgHhĦħIiIiIiIiIıIJijJjKkKkLlLlLlL·l·ŁłMmNnNnNnʼnOoOoOoŒœRrRrRrSsSsSsSsTtTtŦŧUuUuUuUuUuUuWwWwYyYZzZzZzsƒOoUuAaIiOoUuUuUuUuUuUuUuAaÆæØøðÞþPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeƐɛHhIiƗɨMmOoQqUuXxZzß'

edit: sorry about the formatting, my editor seems to have automatically applied the Standard JS style, I can revert those change if we decide to proceed with it.

…e regenerate lib for pattern matching

missinglink · 2023-07-18T19:04:40Z

Note that the regenerate dependency can be removed in favour of the pattern it generates, namely:

[\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u200D\u20D0-\u20FF\u3099\u309A\uFE00-\uFE0F\uFE20-\uFE2F]

tyxla · 2023-07-20T11:59:48Z

Thanks for the PR, @missinglink 🙌

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

Well, we intentionally added replacement for these characters and it's a good example of why such a library is preferred to using String.normalize().

That being said, I'd welcome a simplification of the current approach that supports all current characters that we replace.

refactor: experiment using unicode decomposition+recomposition and th…

0a2af9a

…e regenerate lib for pattern matching

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment using unicode decomposition & regex char ranges #45

experiment using unicode decomposition & regex char ranges #45

missinglink commented Jul 18, 2023 •

edited

Loading

missinglink commented Jul 18, 2023

tyxla commented Jul 20, 2023

experiment using unicode decomposition & regex char ranges #45

Are you sure you want to change the base?

experiment using unicode decomposition & regex char ranges #45

Conversation

missinglink commented Jul 18, 2023 • edited Loading

missinglink commented Jul 18, 2023

tyxla commented Jul 20, 2023

missinglink commented Jul 18, 2023 •

edited

Loading