Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra spacing after emoji variants #1978

Closed
paul opened this issue Sep 18, 2019 · 33 comments
Closed

Extra spacing after emoji variants #1978

paul opened this issue Sep 18, 2019 · 33 comments

Comments

@paul
Copy link

paul commented Sep 18, 2019

I ran across this with the "rainbow flag" emoji, which is in hex => \u1f3f3\ufe0f\u200d\u1f308, or "waving white flag", "variant selector", "zero-width joiner" and "rainbow". However I input it, either by copy/pasting it or using the ctrl-shift-u kitty unicode input, it always renders with extra space afterwards:

image

If I try to print several of them in a row, it prints only a few of them, with really wide spacing. If I add an ascii space between each, then it prints all of them, and closer together.

image

  • kitty 0.14.4
  • noto color emoji font
@kovidgoyal
Copy link
Owner

First of, dont use them in a shell. Shells are full of unicode handling bugs. Instead run cat and see if the issue reproduces there, if so let me know and I will take a look.

@paul
Copy link
Author

paul commented Sep 19, 2019

image

I copied the unicode sequence from here: https://emojipedia.org/rainbow-flag/

First cat is just that emoji in a text file ten times.
Second is the emoji + space ten times
Third is me pressing Ctrl-Shift-V with that emoji 10 times
(Forth is me attempting to Ctrl-Shift-V then Space 10 times, getting an extra space, hitting backspace and pasting again, but that seemed to screw everything up, so I aborted and tried again)
Fifth is is pressing Ctrl-Shift-V then Space in sequence 10 times.

nospace.txt:

🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈

with-space.txt:

🏳️‍🌈 🏳️‍🌈 🏳️‍🌈 🏳️‍🌈 🏳️‍🌈 🏳️‍🌈 🏳️‍🌈 🏳️‍🌈 🏳️‍🌈 🏳️‍🌈 
$ xxd nospace.txt
00000000: f09f 8fb3 efb8 8fe2 808d f09f 8c88 f09f  ................
00000010: 8fb3 efb8 8fe2 808d f09f 8c88 f09f 8fb3  ................
00000020: efb8 8fe2 808d f09f 8c88 f09f 8fb3 efb8  ................
00000030: 8fe2 808d f09f 8c88 f09f 8fb3 efb8 8fe2  ................
00000040: 808d f09f 8c88 f09f 8fb3 efb8 8fe2 808d  ................
00000050: f09f 8c88 f09f 8fb3 efb8 8fe2 808d f09f  ................
00000060: 8c88 f09f 8fb3 efb8 8fe2 808d f09f 8c88  ................
00000070: f09f 8fb3 efb8 8fe2 808d f09f 8c88 f09f  ................
00000080: 8fb3 efb8 8fe2 808d f09f 8c88            ............
$ xxd with-space.txt
00000000: f09f 8fb3 efb8 8fe2 808d f09f 8c88 20f0  .............. .
00000010: 9f8f b3ef b88f e280 8df0 9f8c 8820 f09f  ............. ..
00000020: 8fb3 efb8 8fe2 808d f09f 8c88 20f0 9f8f  ............ ...
00000030: b3ef b88f e280 8df0 9f8c 8820 f09f 8fb3  ........... ....
00000040: efb8 8fe2 808d f09f 8c88 20f0 9f8f b3ef  .......... .....
00000050: b88f e280 8df0 9f8c 8820 f09f 8fb3 efb8  ......... ......
00000060: 8fe2 808d f09f 8c88 20f0 9f8f b3ef b88f  ........ .......
00000070: e280 8df0 9f8c 8820 f09f 8fb3 efb8 8fe2  ....... ........
00000080: 808d f09f 8c88 20f0 9f8f b3ef b88f e280  ...... .........
00000090: 8df0 9f8c 8820                           .....
$ cat nospace.txt| ruby -e 'STDIN.read.split(//).map {|c| printf("U+%04x ", c.ord) }'
U+1f3f3 U+fe0f U+200d U+1f308 U+1f3f3 U+fe0f U+200d U+1f308 U+1f3f3 U+fe0f U+200d U+1f308 U+1f3f3 U+fe0f U+200d U+1f308 U+1f3f3 U+fe0f U+200d U+1f308 U+1f3f3 U+fe0f U+200d U+1f308 U+1f3f3 U+fe0f U+200d U+1f308 U+1f3f3 U+fe0f U+200d U+1f308 U+1f3f3 U+fe0f U+200d U+1f308 U+1f3f3 U+fe0f U+200d U+1f308
$ cat with-space.txt| ruby -e 'STDIN.read.split(//).map {|c| printf("U+%04x ", c.ord) }'
U+1f3f3 U+fe0f U+200d U+1f308 U+0020 U+1f3f3 U+fe0f U+200d U+1f308 U+0020 U+1f3f3 U+fe0f U+200d U+1f308 U+0020 U+1f3f3 U+fe0f U+200d U+1f308 U+0020 U+1f3f3 U+fe0f U+200d U+1f308 U+0020 U+1f3f3 U+fe0fU+200d U+1f308 U+0020 U+1f3f3 U+fe0f U+200d U+1f308 U+0020 U+1f3f3 U+fe0f U+200d U+1f308 U+0020 U+1f3f3 U+fe0f U+200d U+1f308 U+0020 U+1f3f3 U+fe0f U+200d U+1f308 U+0020 

@ctrlcctrlv
Copy link
Contributor

I think that the emoji are making this problem seem harder than it is.

It is at its core an error in ligature handling. Consider Nimbus Mono PS, or my font TT2020 Style B (which is how I first knew about this problem):

2020-01-07-143215_690x360_scrot

(Note: This will only work for Nimbus Mono PS post–d250555, because I removed the specific processing for it now that we have font_features to do it.)

Kitty renders ligatures across multiple cells according to the wcswidth of their components and dnot the wcswidth of their completed form, that's the problem. No rainbow flags necessary.

@kovidgoyal
Copy link
Owner

There is no wcswidth of the completed form, since in general the
completed form is not a unicode codepoint but an arbitrary glyph. Also
if a terminal application print fl to the screen it expects it to take
two cells, if it does not it will break many applications. So I dont
really see how this can be fixed. fl and ligatures in general must
always take the number of cells indicated by the wcswidth() of the
underlying string.

@ctrlcctrlv
Copy link
Contributor

ctrlcctrlv commented Jan 7, 2020

Indeed, I agree fully. My "fix" would be just better spacing—the extra space wouldn't look so bad if it was just factored into the whole word.

We already have ability to disable_ligatures when hover (cursor). So we could repurpose that for spacing. If the width of the ligature X is != the width of wcswidth(X) cells, break the ligature on hover.

Thought?

@kovidgoyal
Copy link
Owner

I dont think that's possible. In kitty character images per cell (or
multipe cells for a ligature) are stored on the GPU and rendered
directly for each cell. You can distribute spacing around a word, unless
you make the entire word a ligature and you cant do that in the general
case because kitty has a max ligature size and words could be longer,
not to mention that long ligatures are not good for performance.

@ctrlcctrlv
Copy link
Contributor

Could not an extra step be added right at the end, to look for badly spaced ligatures, and fix them?

@kovidgoyal
Copy link
Owner

There is no end. Rendering works by mapping each cell to a number. That
number acts as an index into a sprite map which is uploaded to the GPU.
The list of numbers is sent to the GPU and the GPU renders them in a
single pass. In the case of ligatures, the entire ligature is rendered,
then split into cells and the cells added to the sprite map.

@ctrlcctrlv
Copy link
Contributor

Idea for you @kovidgoyal, and I'm sorry, it might be a stupid one. Maybe so stupid you don't even want to reply, which I will understand and not hold against you.

But in case you haven't thought of this already:

The OSWindow contains one or more Windows, which are made of a Screen, which contains cells. Each cell is literally a quad built out of an array of OpenGL vertices, known internally as vao, accessed by vao_idx.

Let's say we have the word affiliate, and the following sprites: «a f_f_i i l t e».

Laid out it would look something like:

0123456789ABCD
affi  liate mark

We know that when we rendered we got back a bitmap of a length less than the number of cells of its components. OK, we're not allowed to render the word all in one pass as a big sprite, due to performance cost. (I'm of the view this, along with making the max ligature length unlimited, could be made optional, contained inside a single option, but really it's a big hack and I won't be proposing a PR to do it, even as a non-default option recommended against.)

But why we can't bump the vertices around to make things look a little nicer, when cursor not over text? (In formal terms, assuming a quad of four vertices Q0…Q3, add a bump factor b to the x axis of all four vertices.)

As I see it there are two possible ways to bump the vertices: either you bump them to remove the space where it is (“bump style A”), and the bump is always expressible in the form b × cell_width, or you bump cells according to a formula (“bump style B”).

Assuming bump style A, the layout appears as:

0123456789ABCD
affiliate   mark

Cells 0 and 1 go unbumped, cells 2 and 3 are bumped to the end, every other cell until the end of the word is bumped backwards one cell.

Bump style B is not expressable in monospace terms, because it works by breaking the grid; cells are bumped in fractions of a cell. It assumes bump style A has already been applied.

So, the absolute value of strlen("affiliate") (X = 7) - strlen("affiliate") (Y = 9) is 2 (S). The bump factor, in number of cells, is easily expressed as S÷2=1.

So all cells which were not bumped to the end need to be bumped one cell:

0123456789ABCD
 affiliate  mark

We can, and ought to I think, also include cell 9 in our calculation, after all, it is visible space. So, really, S=3 and we need to bump each cell one and one half cells.

image

Is this overkill and am I overthinking it? Could very well be. But if cells can be nudged around, don't see why it can't work.

@kovidgoyal
Copy link
Owner

Rendering happens per cell, there is no vertex data for a cell. The
vertex locations on the screen are calculated in the shader from
instance id. See the section set cell vertex positions in the
cell_vertex.glsl shader.

You could of course have a separate array to displace cells that you
send to the GPU on every render, but that is a performance/code
complexity cost that will affect all rendering for a use case that is
not really that important. kitty is designed to render in character
grids. If I really wanted to support complex text shaping I would not
have made that design in the first place. The vast majority of monospace
fonts used in terminals implement ligatures that are the same width as
their constituent characters. Therefore, adding extra per cell data is
not worth it.

@ctrlcctrlv
Copy link
Contributor

I think this should be closed then. Improved spacing is really the best solution to this problem.

@kovidgoyal
Copy link
Owner

The general problem of ligatures certainly. This particular issue however, might be solved by improving wcswidth calculations for flags. I have to look into it someday when i have time.

@ctrlcctrlv
Copy link
Contributor

No @kovidgoyal, not without making the flag itself have a widechar width of 0, but 🏳 on its own has a widechar width of 2.

Using the word "affiliate" is just an easier way to think about the general problem because all its components have a widechar width of 1, instead of 2-0-0-2 as in the rainbow flag ZWJ sequence.

@kovidgoyal
Copy link
Owner

yes the idea would be for wcswidth to know about flags sequences. I dont know how feasible that is, will have to see.

@ctrlcctrlv
Copy link
Contributor

It's not feasible if wcswidth(S) must always equal sum(wcwidth(C) for C in S); changing that will certainly break applications...wcwidth itself was always a hack :-)

@kovidgoyal
Copy link
Owner

wcswidth is defintely not always equal to sum wcwidth() if it were then emoji variation selectors would not work. See the kitty implementation of wcswidth() in screen.c. The only reason for wcswidth to exist at all is that it is not in general equal to sum(wcwidth())

@ctrlcctrlv
Copy link
Contributor

Sorry to say, that if your internal implementation does not match that of glibc's, (which works how I explain,) the mismatch will lead to subtle rendering errors (unless Kitty somehow forces client applications to use its internal wcswidth? But anyway, most programmers familiar with the glibc implementation consider them always equal and wcswidth to be a mere convenience function, so I don't understand how this can work)

@kovidgoyal
Copy link
Owner

And yet it does. Pretty much all advanced terminal applications use their own wcswidth() implementations, precisely because glibc's is a broken umm POS. glibc is not the canonical source for how to calculate widths, the unicode standard is. And kitty's wc(s)width is autogenerated from the unicode standard. Indeed using the system libc's wcwidth() is fundamentally a bad idea because it can be arbitrarily old and broken. Not to mention it can vary between systems when you ssh. Any serious terminal application needs to use a standards based implementation.

This has all been discussed before, ad nauseum, search this issue tracker of wcwidth

@ctrlcctrlv
Copy link
Contributor

Interesting :-)

Well in that case, yes, to my knowledge all emoji ZWJ sequences have a visual wcswidth of 2, but a glibc wcswidth of 4 or more.

I'm sorry to waste your time with repeated discussions

@kovidgoyal
Copy link
Owner

No worries, and yes looking into modifying wcswidth for ZWJ/flags is why this issue remains open.

@jsravn
Copy link

jsravn commented Apr 14, 2020

Hi - is this the same issue I'm seeing when I do curl https://en.wttr.in/format=v2 ?

image

For some reason extra spaces are added after the sun emojis. I've tried this in a few other terminals (gnome-terminal, alacritty, st) and they all render this as expected. Also tried multiple different emoji fonts with the same result. This only seems to happen in kitty.

@trygveaa
Copy link
Sponsor Contributor

@jsravn: No, there is no zero width joiner or flags in that output. The reason you're seing extra spaces after the sun emojis is that the output actually contains extra spaces. This is probably done because most other terminals wrongly consider the sun emoji to be only one character wide, so that service has added spaces until it aligns in those terminals.

You can verify this by running this command:

echo -e "\u2600\ufe0f Sunny"

The sun emoji is the two escaped characters, and as you see, I've added a space between the emoji and the text. In kitty this renders as sun-emoji, space, text, which is the correct rendering. In all other terminals I've tried, except qterminal, you don't see the space.

If you remove the space and run `echo -e "\u2600\ufe0fSunny", you can see that the S appears on top of the sun emoji in the terminals you mention.

@noraj
Copy link

noraj commented Aug 17, 2023

I don't understand anything of the discussion about the low level implementation for rendering so I won't tell you how to do it but I would just say it's possible to implement ZWJ support correctly, an example is Virtual Studio Code or Firefox.

I'm working on Unicode research, writing my code full of weird Unicode sequences in VScode is not problem but as soon as I need to copy stuff in a language interpreter (in the terminal with kitty) it ends up being a nightmare.

Eg. pasting 👩‍❤️‍👨

Kitty

image

Alacrity

image

Foot

image

Hyper

image

it's also the same for Konsole, Tabby, Qterminal, Darktile, Extraterm etc.

I found the reason why most TE fails to implement ZJW support even when they claim full Unicode support: https:/nmeum/saneterm#motivation

Mainstream terminal emulators (urxvt, xterm, alacritty, …) support a standard known as ANSI escape sequences. This standard defines several byte sequences to provide special control functions for terminals emulators. This includes control of the cursor, support for different colors, et cetera. They are often used to implement TUIs, e.g. using the ncurses library.

Many of these escape sequences operate on rows and columns and therefore require terminal emulators to be built around a character grid were individual cells can be modified. Historically, this was very useful to implement UIs on physical terminals like the VT100. Nowadays this approach feels dated and causes a variety of problems. For instance, the concept of grapheme cluster as used in Unicode is largely incompatible with fixed-size columns. For this reason, terminal emulator supporting the aforementioned escape sequences can never fully support Unicode [1].

On the other hand, a terminal emulator not supporting ANSI escape sequences can never support existing TUIs. However, the idea behind saneterm is that terminals shouldn't be used to implement TUIs anyhow, instead they should focus on line-based CLIs. By doing so, a variety of features normally implemented in CLI programs themselves (like readline-keybindings) can be implemented directly in the terminal emulator.

saneterm was the only TE that was rendering grapheme with ZJW correctly but is just a PoC and not really a daily usable TE.

So in the end maybe a violent breaking change is required to take a similar approach or closing this issue as it will never be possible to implement this in an escape sequence based terminal.

@trygveaa
Copy link
Sponsor Contributor

Eg. pasting 👩‍❤️‍👨

Kitty

image

This problem is not caused by the terminal emulator, but by the shell you're using. Run e.g. cat first and paste the emoji and you'll see that it's rendered correctly, but has some extra spacing after it. That extra spacing is what this issue is about.

A terminal emulator that does implement ZWJ sequences correctly and uses the correct width is foot.

I found the reason why most TE fails to implement ZJW support even when they claim full Unicode support: https:/nmeum/saneterm#motivation

The issue described there is not why kitty uses the wrong width for ZWJ sequences. What is described there also applies for emojis using variant selectors, and kitty supports that.

So in the end maybe a violent breaking change is required to take a similar approach or closing this issue as it will never be possible to implement this in an escape sequence based terminal.

I don't think that is required, but the terminal emulator and the terminal application has to agree on the display widths of grapheme clusters which there currently isn't any way to do. What this issue is about (extra spacing) is possible to fix without that though, but it would make the problem worse for TUIs that don't use the correct display widths (which is most).

@noraj
Copy link

noraj commented Aug 17, 2023

It's what I feared, my previous message was offtopic.

This problem is not caused by the terminal emulator, but by the shell you're using.

Do you know a shell with decent Unicode support? Even a non-POSIX one.

@trygveaa
Copy link
Sponsor Contributor

Do you know a shell with decent Unicode support? Even a non-POSIX one.

I've only tried zsh, bash and briefly fish. Of those, only zsh has the problem you describe. In both bash and fish 👩‍❤️‍👨 is rendered correctly, but you may encounter some other problems. In bash you get problems when it tries to change the background color (which happens on paste) and if you move the cursor before the emoji. With fish in kitty it seems to work fine, since fish seems to use the same widths as kitty. You get problems in foot though, because that considers the emoji to be 2 characters wide rather than 6.

@noraj
Copy link

noraj commented Aug 18, 2023

While this screenshot illustrate that 👩‍❤️‍👨 can be displayed on foot terminal and also by fish and bash command line shell but not by zsh...

ZJW-foot

... there is still an issue with kitty rendering ZJW (outside the extra spacing). Kitty can display the ZJW sequence but as soon as you hit the spacebar or enter key the grapheme is decomposed into its code points. The following video was record with bash and there is a comparison with foot.

ZJW-kitty.webm

@christianparpart
Copy link

Eg. pasting 👩‍❤️‍👨

@noraj, What you see there is what your shell is making out of this paste. When you paste a byte sequence into your terminal, this will be fed into your terminal applications standard input, which is ZSH in your case. And ZSH cannot handle grapheme clusters, but tries to be clever by rendering ZWJ / unknown or unprintable (to ZSH's knowledge) Unicode codepoints in the form of <U+ABCD>. You should file a bug to ZSH if you wish to get that fixed.

On the other hand, most likely they won't fix it, because most terminals itself can't handle grapheme clusters. So far I only know one (you mentioned saneterm, which i never heard of, I personally think the ideal behind saneterm is wrong, but let's not go that route in kitty's board :-), I sadly could never test foot because it's Wayland-only )
You could give Contour Terminal a try, which I implemented and is a daily driver for a small user base. I explicitly took care of proper grapheme cluster support. If you find a bug, we'll fix it.

@noraj
Copy link

noraj commented Aug 18, 2023

As I put in my previous message the video is recorded with bash not zsh.

@trygveaa
Copy link
Sponsor Contributor

... there is still an issue with kitty rendering ZJW (outside the extra spacing). Kitty can display the ZJW sequence but as soon as you hit the spacebar or enter key the grapheme is decomposed into its code points. The following video was record with bash and there is a comparison with foot.

No, this is one of the issues I mentioned you'll encounter with bash. Try with cat and you'll see that it doesn't happen (and looks like it doesn't happen with fish either). I don't know the exact details on bash to say why this happens, but it looks like it's because the cursor is moved more in kitty than in foot (which is what this issue is about).

because most terminals itself can't handle grapheme clusters. So far I only know one

Nowadays, several terminal emulators more or less implement this. At least kitty, foot, wezterm, Konsole and as you mentioned contour.

@christianparpart
Copy link

@noraj i apologize for not being specific. i was referring to your comment here: #1978 (comment) :)

@noraj
Copy link

noraj commented Aug 18, 2023

No, this is one of the issues I mentioned you'll encounter with bash. Try with cat and you'll see that it doesn't happen (and looks like it doesn't happen with fish either). I don't know the exact details on bash to say why this happens, but it looks like it's because the cursor is moved more in kitty than in foot (which is what this issue is about).

Ok @trygveaa, I apologize once more time for not having understood the issue correctly.

You should file a bug to ZSH if you wish to get that fixed.

On the other hand, most likely they won't fix it, because most terminals itself can't handle grapheme clusters.

@christianparpart They have no issue tracker, just an email address, I got an answer but I feel the person didn't understand the issue.

So far I only know one (you mentioned saneterm, which i never heard of, I personally think the ideal behind saneterm is wrong, but let's not go that route in kitty's board :-)

There are maybe more, all my example where maybe wrong because the issue was zsh and not the TE themselves.

@kovidgoyal
Copy link
Owner

Variant selectors are handled correctly in kitty. As for ZWJ, that is on my TODO list, see #3810

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants