Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: string/ReadOnlySpan<char> Truncate #85603

Open
hrrrrustic opened this issue May 1, 2023 · 8 comments
Open

[API Proposal]: string/ReadOnlySpan<char> Truncate #85603

hrrrrustic opened this issue May 1, 2023 · 8 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@hrrrrustic
Copy link
Contributor

hrrrrustic commented May 1, 2023

Background and motivation

For int overload there is a #31655 with good use-cases and motivation.

Copy-paste from #31655I propose a new method on String to perform a fairly common task: Truncating a string to a maximum length. Scenarios:

  1. Inserting to a limited-length database field when data loss is preferable over failing. For example, when logging a user agent string of an HTTP request.
  2. Displaying a potentially long string to the user.
  3. Making sure that untrusted input is not unexpectedly long when failure is not desirable.

In my experience, this is quite a common need. I often need to define a helper function for this.
Semantics:

  1. A string that is less than the maximum length is unaltered.
  2. The API never returns a string larger than the maximum length.

Two more from myself:

  1. Scripting/Debugging - to reduce console output when you actually don't need a full string, just a beginning to make sure it looks fine
  2. This is just a shortcut for string.Substring(0, maxLength) or span.Slice(0, maxLength) without manually passing 0 😄

Key difference from #31655 that there is no truncationSuffix. It is too specific for core library and probably should be placed in UI frameworks (also discussed in original issue). Kotlin has similar method Take, Python just doesn't throw on substringing

I also added ROS<char> version to avoid allocating. Also keep in mind that ROS<> even doesn't have Enumerable.Take workaround and should be manually if-checked for length

char overload is basically

var index = str.IndexOf('');
if (index != -1)
    str = str.Substring(0, index);

that is also pretty repeated pattern

API Proposal

namespace System
{
    public class String
    {
        public string Truncate(int maxLength);
        public string Truncate(char until);
    }

   // Same on Span<T> ?
    public readonly ref struct ReadOnlySpan<T>
    {
        public ReadOnlySpan<T> Truncate<T>(int maxLength);
        public ReadOnlySpan<T> Truncate<T>(T until) 
            where T : IEquatable<T>; // For IndexOf
    }

    public static class MemoryExtensions
    {
        // Not sure does it make sense for some vectorization purposes on strings. Feel free to drop this one
        public static ReadOnlySpan<char> Truncate(this ReadOnlySpan<char> span, char until);
    }
}

API Usage

Main use-cases described above, I just link some examples from /runtime. Most of them are for ROS<> since BCL tries to avoid using .Substring, but obviously it's not possible to use ROS<> everywhere instead string
For int overload Math.Min pattern is pretty common:
(More cases in other Asn files)

int len = Math.Min(longestPermitted, source.Length);
return source.Slice(0, len);

(More cases in this and other Tar files)
int numToCopy = Math.Min(bytesToWrite.Length, destination.Length);
bytesToWrite = bytesToWrite.Slice(0, numToCopy);

private static void SliceLongerSpanToMatchShorterLength<T>(ref ReadOnlySpan<T> span, ref ReadOnlySpan<T> other)
{
if (other.Length > span.Length)
{
other = other.Slice(0, span.Length);
}
else if (span.Length > other.Length)
{
span = span.Slice(0, other.Length);

char overload:

int newlinePos = buffer.IndexOf((byte)'\n');
if (newlinePos < 0)
{
return false;
}
ReadOnlySpan<byte> line = buffer.Slice(0, newlinePos);

int plusIndex = versionString.IndexOf('+');
if (plusIndex >= 0)
{
versionString = versionString.Slice(0, plusIndex);

Alternative Designs

  • ROS<> version probably could be generic instead char (and use different index search), but I don't have any examples for non-char usages
  • Also add this to Span<T>? But I don't think there are too much mutable char spans

Risks

There are so much IndexOf variations like LastIndexOf, AnyIndexOf that should be supported or completely ignored

@hrrrrustic hrrrrustic added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 1, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 1, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 1, 2023
@DaZombieKiller
Copy link
Contributor

The span version of Truncate would be more useful if it worked for any equatable T:

public static class MemoryExtensions
{
-   public ReadOnlySpan<char> Truncate(this ReadOnlySpan<char> span, char until);
+   public ReadOnlySpan<T> Truncate<T>(this ReadOnlySpan<T> span, T until)
+       where T : IEquatable<T>;
}

I often have a similar method in projects, it's useful when dealing with buffers containing null-terminated strings.

@ghost
Copy link

ghost commented May 2, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

For int overload there is a #31655 with good use-cases and motivation.

Copy-paste from #31655I propose a new method on String to perform a fairly common task: Truncating a string to a maximum length. Scenarios:

  1. Inserting to a limited-length database field when data loss is preferable over failing. For example, when logging a user agent string of an HTTP request.
  2. Displaying a potentially long string to the user.
  3. Making sure that untrusted input is not unexpectedly long when failure is not desirable.

In my experience, this is quite a common need. I often need to define a helper function for this.
Semantics:

  1. A string that is less than the maximum length is unaltered.
  2. The API never returns a string larger than the maximum length.

Two more from myself:

  1. Scripting/Debugging - to reduce console output when you actually don't need a full string, just a beginning to make sure it looks fine
  2. This is just a shortcut for string.Substring(0, maxLength) or span.Slice(0, maxLength) without manually passing 0 😄

Key difference from #31655 that there is no truncationSuffix. It is too specific for core library and probably should be placed in UI frameworks (also discussed in original issue). Kotlin has similar method Take, Python just doesn't throw on substringing

I also added ROS<char> version to avoid allocating. Also keep in mind that ROS<> even doesn't have Enumerable.Take workaround and should be manually if-checked for length

char overload is basically

var index = str.IndexOf('');
if (index != -1)
    str = str.Substring(0, index);

that is also pretty repeated pattern

API Proposal

namespace System
{
    public class String
    {
        public string Truncate(int maxLength);
        public string Truncate(char until);
    }

    public readonly ref struct ReadOnlySpan<T>
    {
        public Span<T> Truncate(int maxLength);
    }

    public static class MemoryExtensions
    {
        public ReadOnlySpan<char> Truncate(this ReadOnlySpan<char> span, char until);
    }
}

API Usage

Main use-cases described above, I just link some examples from /runtime. Most of them are for ROS<> since BCL tries to avoid using .Substring, but obviously it's not possible to use ROS<> everywhere instead string
For int overload Math.Min pattern is pretty common:
(More cases in other Asn files)

int len = Math.Min(longestPermitted, source.Length);
return source.Slice(0, len);

(More cases in this and other Tar files)
int numToCopy = Math.Min(bytesToWrite.Length, destination.Length);
bytesToWrite = bytesToWrite.Slice(0, numToCopy);

private static void SliceLongerSpanToMatchShorterLength<T>(ref ReadOnlySpan<T> span, ref ReadOnlySpan<T> other)
{
if (other.Length > span.Length)
{
other = other.Slice(0, span.Length);
}
else if (span.Length > other.Length)
{
span = span.Slice(0, other.Length);

char overload:

int newlinePos = buffer.IndexOf((byte)'\n');
if (newlinePos < 0)
{
return false;
}
ReadOnlySpan<byte> line = buffer.Slice(0, newlinePos);

int plusIndex = versionString.IndexOf('+');
if (plusIndex >= 0)
{
versionString = versionString.Slice(0, plusIndex);

Alternative Designs

  • ROS<> version probably could be generic instead char (and use different index search), but I don't have any examples for non-char usages
  • Also add this to Span<T>? But I don't think there are too much mutable char spans

Risks

There are so much IndexOf variations like LastIndexOf, AnyIndexOf that should be supported or completely ignored

Author: hrrrrustic
Assignees: -
Labels:

api-suggestion, area-System.Runtime, untriaged, needs-area-label

Milestone: -

@MichalPetryka
Copy link
Contributor

Maybe add an overload that takes maxLength and a value, so that it truncates to the last occurance of the value when the length is above the limit? I've used such helper in my code to truncate text for display in UI.

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 3, 2023
@hrrrrustic
Copy link
Contributor Author

Updated a proposal with generic overload
I hope someone will see this issue 😄

@tannergooding
Copy link
Member

I don't think this proposal is in the right shape. It differs quite a bit from other APIs we have on the span types.

In general, we want most methods to be extension methods in MemoryExtensions not instance methods on Span/ROSpan.

I'm not sure we want to take this proposal in as a general purpose one. Truncate can have different meanings depending on context. Consider for example what a user might expect it to mean for ReadOnlySpan<float>.

There is also a concern around taking both char and int since this can be very tricky due to implicit conversions.

@tannergooding tannergooding added needs-author-action An issue or pull request that requires more info or actions from the author. and removed untriaged New issue has not been triaged by the area owner labels May 26, 2023
@ghost
Copy link

ghost commented May 26, 2023

This issue has been marked needs-author-action and may be missing some important information.

@DaZombieKiller
Copy link
Contributor

There is also a concern around taking both char and int since this can be very tricky due to implicit conversions.

It's also ambiguous when T is int for the Span<T> versions.

@hrrrrustic
Copy link
Contributor Author

hrrrrustic commented May 26, 2023

It's also ambiguous when T is int for the Span versions.

Thanks! Totally forgot about that.
Well, let's just drop char overload to make this issue easier.

public static class MemoryExtensions
{
    // Same on Span<T>?
    public static ReadOnlySpan<T> Truncate<T>(this ReadOnlySpan<T> span, int maxLength);
}

public class String
{
    public string Truncate(int maxLength);
}

Truncate can have different meanings depending on context

What about simple Take? Just same as Enumerable.Take (and Kotlin version), but more efficient

@ghost ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels May 26, 2023
@stephentoub stephentoub added this to the Future milestone Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

No branches or pull requests

7 participants