Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand semantic highlighting support for CompletionItem.Documentation and Hover.Contents #1056

Open
333fred opened this issue Jul 25, 2020 · 8 comments
Labels
completion feature-request Request for new features or functionality semantic tokens
Milestone

Comments

@333fred
Copy link

333fred commented Jul 25, 2020

Version 3.16 will add semantic highlighting to the protocol, which is a great addition that Omnisharp has already used to great effect. However, adding this to the text editor has then made all the places that use markdown for formatting today stick out, as they're not using real semantic classification and can often mess up highlighting, particularly because they're just snippets. A bare word might refer to a parameter or field or method name, for example, but as the markdown content of the documentation or hover has no extra info, it'll be highlighted as a local variable. I recently implemented some improvements for Omnisharp in the hover provider to use Roslyn's QuickInfo service, but the lack of expressiveness in markdown quickly became apparent. I think the issue is best demonstrated through a picture. Given this simple sample with some documentation comments:

namespace ClassLibrary1
{
    public class Class1
    {
        /// <summary>
        /// Gets an instance of <see cref="Class2"/>
        /// </summary>
        /// <returns>A created instance of <see cref="Class2"/></returns>
        public Class2 M(string param1, int param2)
        {
            return null;
        }
    }
    public class Class2
    {
    }
}

Here's what the hover looks like in VS over M:
image

There's several pieces of this popup that are simply unachievable through unaltered markdown:

  1. The method codicon is displayed inline with the method signature, and the signature is highlighted as C# code. You can't do this with markdown: content inside a code block is interpreted literally, so $(symbol-method) would be interpreted as that literal text and not substituted with a codicon.
  2. All the components of the signature are clickable links that go to the definition of that component. You can click on any highlighted reference to Class2 and it will go to definition on that member. This is also not possible for the same reason as # 1: the []() link syntax inside a code block would be interpreted literally.
  3. The references to Class2 in the summary and returns section are semantically colored and clickable. There are some markdown extensions that support inline colorization of single-line code blocks, but this is a) not semantic, and b) not guaranteed to be supported by the editor. You can get links working for these locations as they're single-line blocks, but you can't get both today.

To solve these, I'd like to propose an extension to MarkupContent. This is a very, very, very rough proposal: I spend most of my time in compiler land and am certain to get parts of this wrong, so criticism and ideas are welcome. This extension has, to my mind, a few specific goals:

  1. Allow for inline semantic rendering of code text.
  2. Allow for rich linking inside the content, able to execute commands based on clicking. This can be LSP commands, such as go-to-definition, but it should be extensible enough to allow for other things. URLs, for example, should be resolvable, as it's perfectly reasonable to put a web link in documentation.

My initial proposal is to add a new kind to MarkupKind to indicate that the content is semantic markdown and provide a SemanticTokens stream that should be used for colorization. In the definition below, I've elided the existing parts of MarkupContent for brevity, but assume they're unchanged except where noted.

/**
 * Describes the content type that a client supports in various
 * result literals like `Hover`, `ParameterInfo` or `CompletionItem`.
 *
 * Please note that `MarkupKinds` must not start with a `$`. This kinds
 * are reserved for internal usage.
 */
export namespace MarkupKind {
    /* (unchanged) */
    /**
     * Semantic markdown is a markdown string, with an accompanying set of semantic tokens to
     * aid in editor colorization of the string. The first token delta is relative to the start
     * of the string, and includes punctuation such as code fences.
     */
     export const SemanticMarkdown: 'semanticMarkdown' = 'semanticMarkdown';
}
export type MarkupKind = 'plaintext' | 'markdown' | ['semanticMarkdown', SemanticTokens];

/**
 * ... (unchanged)
 *
 * *Please Note* that clients might sanitize the return markdown. A client could decide to
 * remove HTML from the markdown to avoid script execution.
 *
 * If the kind is `semanticMarkdown`, the tokens it provides are used when rendering the tokens in
 * this string. The first token delta is considered from the start of `value` and semantic tokens
 * are mapped using the `SemanticTokensLegend` provided by the language server during registration.
 *
 * TODO: More examples and documentation
 * 
 */
export interface MarkupContent {
    /* (unchanged) */
}

For the previous example of hovering over M, this is a rough idea of what you'd provide. I'm going to just use type or identifier names in the url locations, but presumably they'd be actual URIs that link to commands.

$(symbol-method) [`Class2`](Class2) [`Class1`](Class1)`.`[`M`](M)`(`[`string`](string)` `[`param1`](param1)`, `[`int`](int)` `[`param2`](param2)`)`

Gets an instance of [`Class2`](Class2)

Returns:

  A created instance of [`Class2`](Class2)

The accompanying token stream would look something like this. I'm going to make the stream uncompressed for the sake of everyone reading this, and I'm sure I'll get some offset wrong, but it conveys the general idea:

[
  // Line 0
  { line: 0, startChar: 20, length: 5, tokenType: ["class"], tokenModifiers: ["public"] },            // [Class2] Class1.M(string param1, int param2)
  { line: 0, startChar: 39, length: 5, tokenType: ["class"], tokenModifiers: ["public"] },            // Class2 [Class1].M(string param1, int param2)
  { line: 0, startChar: 60, length: 1, tokenType: ["method"], tokenModifiers: ["public"] },           // Class2 Class1.[M](string param1, int param2)
  { line: 0, startChar: 71, length: 5, tokenType: ["keyword", "type"], tokenModifiers: ["public"] },  // Class2 Class1.M([string] param1, int param2)
  { line: 0, startChar: 92, length: 6, tokenType: ["parameter"], tokenModifiers: [] },                // Class2 Class1.M(string [param1], int param2)
  { line: 0, startChar: 114, length: 5, tokenType: ["keyword", "type"], tokenModifiers: ["public"] }, // Class2 Class1.M(string param1, [int] param2)
  { line: 0, startChar: 138, length: 6, tokenType: ["parameter"], tokenModifiers: [] },               // Class2 Class1.M(string param1, int [param2])

  // Line 2
  { line: 2, startChar: 23, length: 5, tokenType: ["class"], tokenModifiers: ["public"] },            // Gets an instance of [Class2]

  // Line 7
  { line: 2, startChar: 27, length: 5, tokenType: ["class"], tokenModifiers: ["public"] },            //  A created instance of [Class2]
]

As I said, this is a very rough first draft, and there are a few open questions I have:

  1. I've tried to reuse markdown as a basis for the proposal to avoid having to come up with yet another markup language for things like text formatting and linking, but is it trying to stretch markdown too far?
  2. Should things that need to be highlighted need to be wrapped in backticks? Or should the presence of a semantic classification for the item do all that a backtick implies?
    a. If yes, backticks should still be specified, should the semantic classification span include the backticks?
  3. What happens if you provide semantic tokens for things inside a triple-backtick code block? In particular, what if that code block already has a language provided?
  4. It would be very nice if language servers could provide additional information along with the semantic token to aid in handling commands. For example, in order to semantically highlight the string the server already had to look up the symbol representation, and it would be good if the information for going to that definition or performing other actions was able to be included with the token. I'm unsure how to best achieve this, however: would we want to include it in the URI link?
@dbaeumer dbaeumer added feature-request Request for new features or functionality completion semantic tokens labels Nov 11, 2020
@dbaeumer dbaeumer added this to the Backlog milestone Nov 11, 2020
@willstott101
Copy link

I understand the screenshot you've posted is for VisualStudio, could you provide an example of a similar method using a modern LSP client. This would make it easier understand the change you're proposing without having a lot of knowledge of LSPs and the way they're used.

I'd love to see LSPs/LSIF used to generate static docs, and something like this seems like an absolute requirement for that, however commenting on your use of markdown is tricky without examples of current markdown capabilities.

@dbaeumer
Copy link
Member

dbaeumer commented Jan 26, 2021

We discussed this internally as well and one idea was to have a special language in markup instead of sending the raw tokens. Something like

Some normal text

` ` `semantic
<tokenType = "class">xxx</tokenType> ....
` ` `

And then the client can turn it into semantic tokens. Would avoid having yet another MarkupKind and can be combined with other text since it is markup.

@333fred
Copy link
Author

333fred commented Jan 26, 2021

@dbaeumer would that include being able to specify the token types for non-blocks? Putting the token type in the markdown block itself is a bit concerning to me given that it means you have to start reasoning about escaping rules with "unescaped" text, but it is certainly a workable approach with some amount effort.

@dbaeumer
Copy link
Member

@333fred I am not sure I understand

would that include being able to specify the token types for non-blocks?

Can you give me an example.

What we should avoid is a format where positions are computed upfront. That makes it very hard to use if people want to type the text and still want to get semantic classification.

@333fred
Copy link
Author

333fred commented Jan 27, 2021

In my original post, I showed a screenshot from VS that included classified text inline with normal text. Basically, being able to specify tokens in a single-backtick codeblock, not just a triple-backtick codeblock.

@michaelmesser
Copy link
Contributor

I have 3 potential ideas. Not sure of any of the 3 are good.

  1. Reuse the existing semantic token format:
  • Add semanticTokensMap?: Record<string, uninteger[]> to MarkupContent
  • In markdown these could be referenced by a special semanticToken language
```semanticTokens:???
code here
```
  • The ??? would say which string to lookup in the semanticTokensMap
  • Downside: Does not work for a single line
  • Upside: Simple
  1. HTML in markdown pre with span with data-token-type and data-token-modifiers.
<pre>
  <span data-token-type="type1">Token1</span>
  <span data-token-type="type2" data-token-modifiers="modifier1,modifier2">Token2</span>
</pre>
  • Downside: HTML adds complexity
  • Upside: HTML is better at complex concepts than markdown
  1. Use/extend a some JSON based rich text format like https:/portabletext/portabletext
  • Downside: Probably too custom
  • Upside: Would give us full control

@0dinD
Copy link

0dinD commented Oct 18, 2021

  1. HTML in markdown pre with span with data-token-type and data-token-modifiers.

Now that VS Code supports rendering a subset of HTML tags in MarkdownStrings (microsoft/vscode#40607) and there is a related capability for LSP (#1344) just waiting for a PR, this seems a lot more feasible. I really like this idea because it's still valid Markdown. Clients could announce support for these semantic tokens by announcing support for span tags with the data-token-type and data-token-modifiers attributes.

The only downside I see is that it's not as space-efficient of a format as the encoded uint-array, but I don't think these kinds of code blocks are nearly as large as whole source files, so maybe that's not an issue?

Another idea if we do want to save space is to borrow the MarkupContent#semanticTokensMap?: Record<string, uinteger[]> solution from your first proposal, but express it like this:

## Highligting of a code block

<pre><code data-semantic-tokens="0">
void foo(Bar bar, String s, int i) {d
    baz(bar, s, i + 3);
}
</code></pre>

## Highlighting of inline code

<code data-semantic-tokens="1">Foo bar(Baz baz)</code>

---

Some _normal_ markdown without semantic tokens.

---

## Highlighting of another code block

<pre><code data-semantic-tokens="2">
void bar(Foo foo, Float f) {
    return false;
}
</code></pre>

where the ranges of the encoded semantic tokens you get from semanticTokensMap are relative to the contents of their corresponding <pre> block, and the keys for the map are specified as a data-semantic-tokens attribute on code tags (inline or in preformatted text blocks). Before sending code blocks like these, the server would check if the client announced support for code tags with the data-semantic-tokens attribute.


Having said this, I also think it would be really useful to be able to provide commands (as links) on spans in the code block, in order to be able to provide something like "go to symbol" on the tokens. But that's probably outside the scope of this issue, unless we add that data to these semantic tokens and consider them different from the ones provided via textDocument/semanticTokens/.

Although now that I think about it, this would already be possible in my <pre><code> solution, because both those tags permit phrasing content, i.e. you could do this:

<pre><code data-semantic-tokens="3">
void bar(<a href="file:/path/to/Foo">Foo</a> foo, Float f) {
    return false;
}
</code></pre>

assuming that the client announces support for a tags with the href attribute.

But that again raises the question of how semantic token ranges should handle the fact that the contents of the code block are not the same as what's rendered to the user. Again, a simple solution would be to encode the semantic tokens as <span data-token-type="..." data-token-modifier="..."> elements.

Another idea is to add MarkupContent#definitions?: Record<string, LocationLink[]>, which for each code block with a given ID gives a number of LocationLinks with their originSelectionRange again relative to the contents of that code block.

Considering this, maybe the attribute on the code element should be more generically named, say data-embedded-code-id instead of data-semantic-tokens?

@michaelmesser
Copy link
Contributor

I think the simplest proposal would be to just add semanticTokens?: uninteger[] to MarkupContent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
completion feature-request Request for new features or functionality semantic tokens
Projects
None yet
Development

No branches or pull requests

5 participants