Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'Span::len()' and 'Span::trimmed()' proc-macro methods. #54373

Closed
wants to merge 1 commit into from

Conversation

SergioBenitez
Copy link
Contributor

Span::len() is fairly self explanatory. The latter, trimmed(), is necessary to get sub-spans of an existing Span. Rocket uses this, for instance, to point into slices of a string.

@rust-highfive
Copy link
Collaborator

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 20, 2018
@SergioBenitez
Copy link
Contributor Author

r? @alexcrichton

@rust-highfive rust-highfive assigned alexcrichton and unassigned aturon Sep 20, 2018
@petrochenkov
Copy link
Contributor

Previous attempt: #53930

@alexcrichton
Copy link
Member

Thanks for the PR! My original comment on #53930 applies here as well though I think, do you have thoughts on that?

cc @dtolnay, you're likely interested in this as well

@SergioBenitez
Copy link
Contributor Author

SergioBenitez commented Sep 21, 2018

Copying over your comment from #53930:

Seems like a plausible API! I think this could definitely work for things like spans of string literals, but how does this work for spans of Group tokens? There's no way to know where subspans are within a larger span, and the integers here are byte positions in the source and the token stream doesn't know much about whitespace?

If I understood this correctly, I think you're saying that trimmed wouldn't be particularly useful for Group tokens since the proc-macro wouldn't know how the tokens map to source code, is that right? If so, I can totally see that, but I wonder if that's necessarily a problem here. I don't know why someone would want to trim such a Span, though perhaps that's your point: we're exposing an interface that's too expressive and thus easy to misuse.

An alternative is to expose len and sub_span methods from Literal and Ident instead. The downside is that libraries like syn must then re-expose these methods from their own AST types.

On another API note, does this handle pointing a byte position to the middle of a multibyte unicode character? Or would that perhaps cause an ICE later down the line?

We should definitely handle ensuring that the sub_span starts and ends at a unicode character boundary when the Span is created. If we make sub_span a method of Literal and Ident such a thing would be easy to handle.

@dtolnay
Copy link
Member

dtolnay commented Sep 21, 2018

The subspan method from #53930 is more intuitive to me than trimmed. If I am println and I want to error on part of a string literal,

"a={a} b={b} c={c}"
         ^^^

then I would rather express this as subspan(9..12) than trimmed(9, 7).

@alexcrichton
Copy link
Member

Another aspect I'm worried about as well is that it's not clear that we could actually support this sort of span selection via indices that well. For example with strings if you wanted to have character at the letter 'a' how do you handle strings like r#"a"#, r####"a"####, "\u{61}", etc? I'm not sure we have a way to get the actual source string, just the value after parsing? In that sense how do we know where within the actual source code representation to place the carets?

@alexcrichton
Copy link
Member

(I do agree with @dtolnay as well that the API to behave slice-like makes more sense to me too)

@SergioBenitez
Copy link
Contributor Author

SergioBenitez commented Sep 22, 2018

(I do agree with @dtolnay as well that the API to behave slice-like makes more sense to me too)

I do too, actually. Happy to change that.

I'm not sure we have a way to get the actual source string, just the value after parsing?

I was wondering this, actually; what is .to_string() on a Literal supposed to return? It's not well specified. I assumed it was the source string of the literal, but that doesn't appear to be the case.

I see ~three paths forward here:

  1. Expose the source string in some way; expose sub_span(), let the user figure out how the parsed value maps back to the source string.

  2. Expose more information in Literal about the source string and the literal itself. In particular, expose a non-exhaustive (for forwards-compatibility) enum that specifies the kind of literal with (almost) enough information to recover the source string. Such an enum might look like:

    enum LiteralKind {
        Byte,
        Char,
        Integer(Prefix, Suffix),
        Float(Prefix, Suffix),
        CookedStr,
        RawStr(usize),
        ByteStr,
        RawByteStr(usize),
        #[doc(hidden)]
        __NonExhausitive
    }
    
    struct Prefix(&'static str);
    
    impl Prefix { fn len(&self) -> usize }
    
    struct Suffix(&'static str);
    
    impl Suffix { fn len(&self) -> usize }

    Of course, when there's any kind of escaping going on, this isn't sufficient to recover the source string.

  3. Have sub_span() "do the right thing" automatically. In particular, it can compute a mapping from an indexed passed in, corresponding to an index in the string returned from to_string(), to an index in the source string.

    So, for something like:

    * Source String: `r##"hello"##`
    * `.to_string()` -> hello
    * foo.sub_span(2..4) -> `r##"hello"##`
                                   ^^
    

    Or, similarly:

    * Source String: `"hel\u{0x6c}o"`
    * `.to_string()` -> hello
    * foo.sub_span(2..4) -> `"hel\u{0x6c}o"`
                                ^^^^^^^^^
    

    This has fairly natural mappings for most things. Floats might be a bit weird, but I think we can find a natural mapping as well. The downside with this approach is the complexity of implementation.

I'm inclined to suggest 3) as the path forward, though I do want to note that 1) allows the user to implement 3) themselves, and is much easier to implement.

I'm interested in learning if you have any other ideas @dtolnay, @alexcrichton. It's really important for Rocket's diagnostics that we're able to create subspans. In Rocket's case, we only need subspans of cooked strings, so I'd be happy with a targeted solution there, though I'm sure others will need a more general solution that works for other literals or idents.

@dtolnay
Copy link
Member

dtolnay commented Sep 22, 2018

I like 3 a lot if the implementation works out.

What is the right behavior if a macro scrambles spans across different tokens? Suppose we have:

#[a]
#[b]
const S: &str = "hel\u{0x6c}o";

If a (or b, whichever expands first) expands to the same tokens but with the string literal having the span that was originally the span of the equal sign, then the one that expands second will observe string_literal.to_string().len() != string_literal.span().len(). Even in the case that lengths happen to match, the span you're slicing may be the span of some totally different original token (potentially with multi-byte character boundaries at different offsets, which is extra unfortunate).

We should definitely handle ensuring that the sub_span starts and ends at a unicode character boundary when the Span is created.

If possible, I would prefer that subspan allow slicing anywhere that is in bounds. Afterward whatever code needs a span aligned to character boundaries can round outward to the nearest character boundary. This way the naive cases like sp.subspan(0..1) will just do the right thing in the presence of multi-byte chars.

@SergioBenitez
Copy link
Contributor Author

What is the right behavior if a macro scrambles spans across different tokens? Suppose we have:

I think we'd need to either 1) keep track of the original span and always return the Span corresponding to the literal, regardless of whether it's been changed, or 2) record that the Span has changed and return a None in that case. Option 1) sounds similar to what you implemented in #53902, if I'm not mistaken?

If possible, I would prefer that subspan allow slicing anywhere that is in bounds. Afterward whatever code needs a span aligned to character boundaries can round outward to the nearest character boundary.

Why do you prefer this behavior? I'm not necessarily opposed to it, except that it's not consistent with how how slicing on strings works today. Is there a benefit to this relaxed approach?

@alexcrichton
Copy link
Member

I also really like option 3 if we can get it to work, but I wonder if we could perhaps strike a more minimal compromise for now? What if we only added Literal::subspan instead of Span::subspan? (or some method to only get a sub-span of a Literal). That may help solve the mismatched span issue more obviously (an obvious candidate to return None as it no longer has the original span?).

I think it'd also provide a clear location to document the internal parsing that rustc would be doing as we could clearly say "these indices are indexes into a string literal's actual literal value" and... do "something reasonable" if the literal isn't a string (maybe return None?)

@SergioBenitez
Copy link
Contributor Author

What if we only added Literal::subspan instead of Span::subspan?

Absolutely. That's what I was proposing. :)

Sounds like we all agree that option 3 is the way to go. I'll give its implementation some thought and see what I come up with.

@alexcrichton
Copy link
Member

Er sorry, then yeah that sounds great to me!

@TimNN TimNN added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 16, 2018
@TimNN
Copy link
Contributor

TimNN commented Oct 16, 2018

Ping from triage @SergioBenitez: What is the status of this PR?

@SergioBenitez
Copy link
Contributor Author

SergioBenitez commented Oct 17, 2018

@TimNN I have not had a chance to implement the revised strategy.

@TimNN
Copy link
Contributor

TimNN commented Oct 30, 2018

Ping from triage @SergioBenitez: It looks like this PR hasn't been updated in a while, so I'm closing it for now, per our guidelines. Thanks for your contributions and please feel free to re-open in the future.

@TimNN TimNN closed this Oct 30, 2018
@TimNN TimNN added S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Oct 30, 2018
@dtolnay dtolnay self-assigned this Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants