Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata should not depend on (absolute) text spans #11304

Open
4e6 opened this issue Oct 11, 2024 · 11 comments
Open

Metadata should not depend on (absolute) text spans #11304

4e6 opened this issue Oct 11, 2024 · 11 comments
Assignees
Labels

Comments

@4e6
Copy link
Contributor

4e6 commented Oct 11, 2024

User Visible Goal

Let the user "edit source files in an external editor" without completely breaking information persisted in the METADATA section. E.g. without loosing positions, color, etc. of nodes in the graph, state of widgets, etc.

Terminology

The META-DATA section consists of two lines and on the fly mappings that are part of TextEdit requests:

  • first line defines UUIDs and their source locations - together with on the fly mappings it is called IdMap
  • second line (nick named "IDE line") contains meta-data (position, color, state, etc.) associated with UUIDs defined in the first line
  • since Remove expression UUIDs from metadata section of a source file #10182 there also are on the fly mappings transferred as part of TextEdit requests - together with the first line of META-DATA section forming so called IdMap

UUIDs are important for language server protocol. IDE and the engine communicate via language server protocol and they use UUIDs to identify "locations".

Since #10182 the first META-DATA line contains only UUIDs that appear on the second line - e.g. are used by IDE to associate some persistent information with those locations. The rest of the UUIDs is generated on the fly - each TextEdit request can contain an on the fly additions to the first META-DATA line in the file forming new IdMap.

Constraints

  • as it is hard to estimate what impact of changes to UUID system would have on the language server protocol - keep UUID as they are now
  • as it is easy to see what impact change of a format of META-DATA section has - redesign the META-DATA section format
  • as it is possible to design a META-DATA format that is resilient to user changes - just redesign the META-DATA section format so that it does not depend on absolute text spans.

That way we can satisfy the user goal without impacting the whole system and keep the change located just to the change of the META-DATA section format.

The New METADATA Section

The metadata stored in files (by the IDE) currently only relates to nodes - e.g. to var = expr statements inside of method bodies. Let's use the AST path to such element as an anchor to identify a semantic location in the source code. E.g. each IDE node can be identified using its path in the AST tree. Let's use following format for the anchor: {method pointer}.{variable name}. I.e.

main =
    op1 = expr1

the op1 node can be identified as main.op1.

Local Text Spans

In addition to the above format of AST based anchor identification, we have to have a way to specify an exact location (just like the current system of absolute text spans does). To do so, let's support exact identification by relative text spans. E.g.:

  • instead of [Span, UUID] pairs (absolute span from the beginning of the file to UUID mapping),
  • let's use [AST Path, Span, UUID] tripple

AST Path provides an offset neutral location inside of the AST - an anchor resilient to user edits (all but removal or rename of a method or its variable). The local text span allows to fine tune the location to any expression or element inside the nearest AST anchor.

Will it Work?

  • IDE & language server continue to use UUID in its protocol as usual
  • New META-DATA section allows to specify all the locations current system can specify
  • the [AST Path, Relative Span, UUID] tripple remains stable with all the edits not related to the anchor itself or content up to next anchor
  • the new META-DATA section format will be versioned - once it is found insufficient (for example because of defining patterns on LHS) we design new format

Yes, it is going to work.

@4e6 4e6 added the -parser label Oct 11, 2024
@kazcw
Copy link
Contributor

kazcw commented Oct 11, 2024

I like the idea of having symbolic source-code references, but there are a lot of syntactic cases that will each need their own solution.

The proposed path type ({method pointer}.{variable name}) can identify a binding, but we have plans to support attaching metadata to any subexpression (e.g. widget picker: #8754). Even today, not every component shown in the graph has a unique binding; a component can be:

  • a method's return expression
  • an expression-statement without a binding, other than a return expression
  • a method argument definition

Another thing to consider is that the LHS of a binding is not strictly an identifier; it is a pattern. I don't think the backend currently supports destructuring-bindings at all, but once that is implemented there won't always be a simple way to stringify the LHS of a binding.


If the goal is storing metadata in a way that it is resilient to external edits, there's a simpler way: We can use the module source code as a map of itself. Include a snapshot of the module alongside the serialized metadata; then to load a module from disk:

  • Parse the snapshot to an AST; attach the metadata.
  • Use Ast.syncToCode to update the parse tree to the current source code.

This way we would preserve all metadata, anywhere in the AST.

@farmaazon
Copy link
Contributor

farmaazon commented Oct 14, 2024

Another thing to consider is that the LHS of a binding is not strictly an identifier; it is a pattern. I don't think the backend currently supports destructuring-bindings at all, but once that is implemented there won't always be a simple way to stringify the LHS of a binding.

I think this is not much of a problem: the key of a given metadata may be just the entire binding - assuming every binding must introduce a variable, they would have to be unique anyway.

As for subexpressions: I think we could just design a "breadcrumb" identification of widgets inside an existing node, which could be even a bit smarter than AST crumbs.

The only real problem I see here are the "bindingless" nodes - but here we could make our graph requiring to give them a name when trying to assign any metadata (like position).

If the goal is storing metadata in a way that it is resilient to external edits, there's a simpler way: We can use the module source code as a map of itself. Include a snapshot of the module alongside the serialized metadata; then to load a module from disk:

  • Parse the snapshot to an AST; attach the metadata.
  • Use Ast.syncToCode to update the parse tree to the current source code.

This way we would preserve all metadata, anywhere in the AST.

How Ast.syncToCode is resilient to reordering lines inside the definition? This is one of the advantages of storing metadata "by binding".

@kazcw
Copy link
Contributor

kazcw commented Oct 14, 2024

If the goal is storing metadata in a way that it is resilient to external edits, there's a simpler way: We can use the module source code as a map of itself. Include a snapshot of the module alongside the serialized metadata; then to load a module from disk:

  • Parse the snapshot to an AST; attach the metadata.
  • Use Ast.syncToCode to update the parse tree to the current source code.

This way we would preserve all metadata, anywhere in the AST.

How Ast.syncToCode is resilient to reordering lines inside the definition? This is one of the advantages of storing metadata "by binding".

Currently it tracks reordered lines, but not lines that are both reordered and mutated:

// Movement matching: For each new tree that hasn't been matched, match it with any identical unmatched old tree.

It would be straightforward to add binding-aware block comparison in order to handle reordered, mutated lines--easier I think than defining an addressing scheme that can identify any of the syntactic constructs we render as components, and any of their subexpression ASTs.

@farmaazon
Copy link
Contributor

Well, I think it sounds quite good to me. I would only make sure the code snapshot is "encrypted" for the user, so they won't edit the snapshot instead of the code by accident. Something sort of "compress + base64".

@JaroslavTulach
Copy link
Member

LHS of a binding is not strictly an identifier; it is a pattern. I don't think the backend currently supports destructuring-bindings at all

Essential part of new meta-data format is identification of its version. It doesn't matter that the format isn't good enough for future evolution of the language/engine. Once it is found insufficient, we will define new format and change its version identification.

edit source files externally without (totally) breaking the METADATA section

We are looking for a fast solution that allows users to edit .enso files in an external editor and load the files back into the IDE without total layout reset.

@JaroslavTulach JaroslavTulach changed the title Metadata should not depend on text spans Metadata should not depend on (absolute) text spans Oct 16, 2024
@JaroslavTulach
Copy link
Member

  • Parse the snapshot to an AST; attach the metadata.

@kazcw explained to me:

Include a snapshot of the module alongside the serialized metadata

What is a snapshot of a module?

The idea is that the .enso file would include the source code twice: Once in plain text, externally-editable, and once "armored" (compress and base64 or the like). Then we will always have the IDE's last state to compare to the possibly-externally-edited source.

I see. Such a duplication goes against the attempt to make META-DATA section smaller. Making the enormous meta-data smaller was a huge driver behind

We want to make sure the meta-data section is even smaller than right now (ideas described in #7989), not doubling the size of the user code.

@farmaazon
Copy link
Contributor

  • Parse the snapshot to an AST; attach the metadata.

@kazcw explained to me:

Include a snapshot of the module alongside the serialized metadata

What is a snapshot of a module?

The idea is that the .enso file would include the source code twice: Once in plain text, externally-editable, and once "armored" (compress and base64 or the like). Then we will always have the IDE's last state to compare to the possibly-externally-edited source.

I see. Such a duplication goes against the attempt to make META-DATA section smaller. Making the enormous meta-data smaller was a huge driver behind

We want to make sure the meta-data section is even smaller than right now (ideas described in #7989), not doubling the size of the user code.

I think a compressed snapshot won't take as much.

Also, the doubling of source code is ok for me - code files aren't particularly big after all. And, in files where every node has metadata attached (position, visualization, color...) it will be hard not to double the code size, actually.

The problem we had was not that the metadata section doubled the size, but that it increased it two orders of magnitude.

@jdunkerley
Copy link
Member

  • Parse the snapshot to an AST; attach the metadata.

@kazcw explained to me:

Include a snapshot of the module alongside the serialized metadata

What is a snapshot of a module?

The idea is that the .enso file would include the source code twice: Once in plain text, externally-editable, and once "armored" (compress and base64 or the like). Then we will always have the IDE's last state to compare to the possibly-externally-edited source.

I see. Such a duplication goes against the attempt to make META-DATA section smaller. Making the enormous meta-data smaller was a huge driver behind

We want to make sure the meta-data section is even smaller than right now (ideas described in #7989), not doubling the size of the user code.

I think a compressed snapshot won't take as much.

Also, the doubling of source code is ok for me - code files aren't particularly big after all. And, in files where every node has metadata attached (position, visualization, color...) it will be hard not to double the code size, actually.

The problem we had was not that the metadata section doubled the size, but that it increased it two orders of magnitude.

Agree - I'm not worried about making the metadata smaller than it is now. The previous version where it would be multiple kb for a small file was the problem.

The most important goal of this change is to make it more resilient to external edits (we want to enable changing descriptions or editing in VS Code without losing all metadata). Adding versioning should also allow us to evolve it going forward which would be a great win.

If we end up with something where a user could rename a variable in a text editor with find and replace, this would be a fantastic. The original suggestion of reffering to {method}.{variable}#offset or similar could easily allow this.

@kazcw
Copy link
Contributor

kazcw commented Oct 16, 2024

If we end up with something where a user could rename a variable in a text editor with find and replace, this would be a fantastic. The original suggestion of reffering to {method}.{variable}#offset or similar could easily allow this.

The syncToCode algorithm handles this too. I designed it not just for the code editor but so that we could correctly handle any external edits that occur while the IDE is running. My proposal of saving source code "snapshots" would extend usage of the algorithm we're already using for this purpose to the case where changes happen when the IDE is closed.

@jdunkerley
Copy link
Member

jdunkerley commented Oct 16, 2024

This feels like a much larger piece of work than the original suggestion.
Ideally this would be delivered in this or the next sprint.

@kazcw how long would take to implement this kind of approach (bearing in mind it couldn't interrupt your work stream you already have)?

And presumably, other than us throwing it away later - the other approach wouldn't stop us doing it later.

@kazcw
Copy link
Contributor

kazcw commented Oct 16, 2024

@jdunkerley

This feels like a much larger piece of work than the original suggestion. Ideally this would be delivered in this or the next sprint.

@kazcw how long would take to implement this kind of approach (bearing in mind it couldn't interrupt your work stream you already have)?

And presumably, other than us throwing it away later - the other approach wouldn't stop us doing it later.

3 days, at most.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 📤 Backlog
Development

No branches or pull requests

5 participants