Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce System.IString interface and implement it on System.String #19711

Closed
pharring opened this issue Dec 16, 2016 · 20 comments
Closed

Introduce System.IString interface and implement it on System.String #19711

pharring opened this issue Dec 16, 2016 · 20 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Runtime design-discussion Ongoing discussion about design without consensus
Milestone

Comments

@pharring
Copy link
Contributor

Proposal

Introduce an interface called IString in the System namespace. The purpose of the interface is to identify string-like objects and provide a minimal set of methods over them. The native System.String type should implement this new interface.

Motivation

The motivation for this proposal is to allow library authors to implement efficient alternative representations of 'string-like' objects. An IString, therefore, represents the essence of a string: An immutable, finite sequence of Unicode characters that supports enumeration and random access.

Rationale

In code which does heavy text processing, strings are invariably the most frequently allocated type. Many strings are short-lived, but few are "novel". By "novel", I mean that most strings are realized by transforming existing data - they don't spring into existence out of thin air. Some common examples:

  1. A substring of an existing string.
  2. A concatenation of two or more existing strings or characters.
  3. The result of de-serializing a stream via an encoding.

In each of these cases (and many others), since public APIs tend to be built around the System.String primitive, early realization of the transformations is unavoidable. Both callers and implementors are forced to use the primitive representation, even when it would be inefficient. e.g. Trimming off the last character of a string by creating a substring with length = Length - 1.

By using IString instead of string in both the public APIs and internal implementations, library authors have an opportunity to reduce these hard-to-avoid allocations. Indeed, if we had something like this from the beginning in Roslyn, I feel we would have used it extensively.

Details

The proposed definition of IString is:

namespace System
{
    using System.Collections.Generic;

    interface IString : IEnumerable<char>
    {
        char this[int index] { get; }
        int Length { get; }
    }
}

Note that this definition is very similar to System.Collections.Generic.IReadOnlyList<char> except:

  1. The length property is named Length instead of Count, and
  2. It's easier to type (fewer chars) and discover (lives in the System namespace)

Examples

By way of an example for an alternative representation of a string-like thing using System.IString, here is a canonical "SubString" implementation:

using System;
using System.Collections;
using System.Collections.Generic;

sealed class SubString : IString
{
    private readonly IString _original;
    private readonly int _start;
    private readonly int _length;

    public SubString(IString original, int start, int length)
    {
        // TODO: Decide whether to throw on invalid start/length or just clamp the values.
        _original = original;
        _start = start;
        _length = length;
    }

    public char this[int index]
    {
        get
        {
            if (index < 0 || index >= _length)
            {
                throw new IndexOutOfRangeException(nameof(index));
            }

            return _original[_start + index];
        }
    }

    public int Length => _length;

    public IEnumerator<char> GetEnumerator()
    {
        // TODO: If _start is zero, then use the underlying enumerator.
        for (int i = 0; i < _length; i++)
        {
            yield return _original[i + _start];
        }
    }

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

Here is a simpler example of an IString representing a single char:

using System;
using System.Collections;
using System.Collections.Generic;

sealed class CharString : IString
{
    private readonly char _ch;

    public CharString(char ch)
    {
        _ch = ch;
    }

    public char this[int index]
    {
        get
        {
            if (index != 0) throw new IndexOutOfRangeException();
            return _ch;
        }
    }

    public int Length => 1;

    public IEnumerator<char> GetEnumerator()
    {
        yield return _ch;
    }

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

And, finally, here's a wrapper over a char array:

using System;
using System.Collections;
using System.Collections.Generic;

sealed class CharArrayWrapper : IString
{
    private readonly char[] _array;

    public CharArrayWrapper(char[] array)
    {
        _array = array;
    }

    public char this[int index] => _array[index];

    public int Length => _array.Length;

    public IEnumerator<char> GetEnumerator() => (IEnumerator<char>)_array.GetEnumerator();

    IEnumerator IEnumerable.GetEnumerator() => _array.GetEnumerator();
}

Comparison with IReadOnlyList<char>

There is an existing proposal, #19609, to add IReadOnlyList<char> to the set of interfaces implemented by System.String.

While that would be a sufficient alternative, the big downside is that it's harder to type System.Collections.Generic.IReadOnlyList<char>. For code that consumes such strings, one could mitigate that with a using alias, but it doesn't help the public API surface area. Consumers probably already have a using statement for the System.Collections.Generic namespace, but it's still hard to discover, and not immediately obvious to the uninitiated, that IReadOnlyList<char> is the 'I behave like a string' marker. I believe it's only a coincidence that IReadOnlyList<char> has the exact shape required to be that marker. We'll Ignore, for now, the different ways of spelling 'Length'.

IReadOnlyList<char> has an advantage, though, because char[] already implements it. i.e. a char[] would also be a valid 'string-like' implementation without modification. That would not be the case with System.IString, hence the need for the CharArrayWrapper shown above.

If we're willing to entertain both proposals, then we could redefine System.IString as:

namespace System
{
    using System.Collections.Generic;

    interface IString : IReadOnlyList<char>
    {
    }
}

and accept the (mis-)spelling of 'Length' ;-)

Limitations

It's hard to come up with a concatenation implementation that performs well in all cases: Do you chain to the left, to the right, or try to keep the 'tree' balanced? However, for specific cases where you're building "compound" strings by concatenation, sometimes a special-purpose object will do. e.g. It's possible to create a custom IString implementation to represent a fully-qualified type name from its parts while avoiding allocating the final string.

In some cases, it is better to do eager string realization. For example, if you're going to do a lot of random access operations on an IString that is actually a complex, compound implementation (e.g. a concatenation of a bunch of substrings of a UTF8-encoded text buffer). Advanced library implementations would detect when to switch to a more efficient implementation (e.g. rebalance a tree, or decode the entire UTF8 buffer). For other cases, clients can always use the .ToString() escape hatch. Certainly, we assume that library authors will strive to make most situations no worse than using System.String directly.

Future

If System.IString is adopted, there are plenty of places in the Framework where we could/should provide new overloads for methods that currently take System.String. e.g. Serialization, formatting, encoding, path building/parsing).

Of course, none of the performance benefits will be realized if a consumer of an IString-returning API immediately turns around and calls .ToString() on the result (perhaps in order to call another API which accepts only System.String). Without broad library support, this proposal is likely to falter. We have to start somewhere, though, and, if we're willing to entertain a new interface on string, then I humbly suggest that it has a name that is easy to discover and type.

@pharring
Copy link
Contributor Author

Looking for feedback from @KrzysztofCwalina @joperezr @stas-sultanov @sharwell @JonHanna

@danmoseley
Copy link
Member

@stephentoub

@joperezr
Copy link
Member

We recently closed a similar issue #18418. cc: @AlexGhiondea

@mellinoe
Copy link
Contributor

I think I need more convincing that this is actually a distinct concept from IReadOnlyList<char>, as proposed. The linked issue (#11391) actually describes a (in my mind) real IString interface (exposing the operations that string does), but this just looks like IReadOnlyList<char> with one member renamed.

Basically, I'm not really following this logic here:

[it's] not immediately obvious to the uninitiated, that IReadOnlyList is the 'I behave like a string' marker. I believe it's only a coincidence that IReadOnlyList has the exact shape required to be that marker.

I don't think "having a count and being read-only indexable" are what it means to behave like a string. To me, that is behaving as an IReadOnlyList. A type that behaved like a string would be able to do lexical operations like splitting, case-changing, concatenating, trimming, etc.

@pharring
Copy link
Contributor Author

@joperezr Thanks for the pointer to #18418. Apart from the name, this is different. This proposal is much closer to the IReadOnlyList<char> proposal, #19609. It is definitely not trying to provide a full abstraction over string operations which, I believe, was @KrzysztofCwalina's objection to #18418. I agree with that, but largely because System.String suffers from the same fault as C++'s std::string : too many methods (member functions). The resulting interface would be too 'fat'. All of System.String's methods that return a new string (except ToString() itself, obviously) would/could be better implemented as static helper methods (or extension methods on IString). If we didn't have all those factory methods on System.String, then it would reduce to something close to this proposal. Unfortunately, it's too late to change.

@pharring
Copy link
Contributor Author

@mellinoe I tried to explain above why this does not aim to provide abstractions over those lexical operations. As for not following the logic: I see, yes, 'I behave like a string' misses the mark. It's more like "I am an ordered sequence of chars with a definite length and I support random access". I believe that definition is minimal enough to be simple to implement but complete enough to be very useful. Random access could be dropped in many cases, but it's also too useful to ignore. Most string operations (split, case-changing, concat, trim, etc.) can be implemented efficiently over this interface (without adding them as separate methods themselves) and most consumers either store the value in a field or pass it on unmodified to another API.

@mellinoe
Copy link
Contributor

I see, yes, 'I behave like a string' misses the mark. It's more like "I am an ordered sequence of chars with a definite length and I support random access". I believe that definition is minimal enough to be simple to implement but complete enough to be very useful.

Yeah, I think we're in agreement there. To me, though, that sounds exactly like the definition of IReadOnlyList<char>, so I'm just trying to understand whether there's an important distinction in calling it IString when that name sounds more like what is described in #18418.

@pharring
Copy link
Contributor Author

@mellinoe Thanks for the feedback and glad we're in agreement. I tried to describe why I think System.IString is a better name than System.Collections.Generic.IReadOnlyList<char> in the proposal: In the System namespace and fewer characters. I originally tried to squeeze this naming proposal into #19609 but a comment there led me to make this separate proposal. I see what you mean, though. People might think that an IString implies more than this minimal functionality. Naming things is hard.

An alternative I came up with is System.IImmutableCharSequence, but it's a bit of a mouthful.
System.IString is such an attractive name.

@AlexGhiondea
Copy link
Contributor

@pharring thanks for writing this proposal!

I think the shape of this interface is so close to an existing one that I find it hard to justify adding it. Assuming string implemented IReadOnlyList<char> would your use cases be captured?

Naming things is really hard indeed! 😄 If we use IString here we would have to find a different name for the interface that describes string-like objects.

/cc @terrajobst

@pharring
Copy link
Contributor Author

@AlexGhiondea You're right. It is very close to IReadOnlyList<char>, and if that's where we end up with #19609, it won't be so bad. However, I feel that IReadOnlyList<char> is an unattractive name for reasons I've tried to explain above. To drive adoption, you want an attractive name that can appear naturally in method signatures and field definitions. I think my feelings would be quelled if there were a way to define an alias at the Framework level. Can that be done, for example, with type forwarders? If so, it would have the benefit that char[] gets included automatically. This would help (eliminate) some pieces of corefx that have had to deal with the duality:
https:/dotnet/corefx/blob/master/src/Common/src/System/CharArrayHelpers.cs
https:/dotnet/corefx/blob/master/src/Common/src/System/Text/StringOrCharArray.cs

I've also tried to explain why I don't think we want an interface that contains lexical and generative string operations. I believe is a mistake (an anti-pattern) that System.String contains non-static methods like Trim, Split and ToLower. How do you know where to stop? We were, I assume, copying Java - binging on OOP at the height of its popularity - and we know better now. Of course, that can't ever be remedied. So, I don't think we'd want to use IString for something like that.

@JonHanna
Copy link
Contributor

An immutable, finite sequence of Unicode characters that supports enumeration and random access.

Is immutable here "I'm not going to let you mutate this" or "I promise nothing will mutate this? Does that distinction need to be made/understood?

If I'm working with a SubString from the above I don't know if its mutable or not in that it could be backed by a truly (barring some potshots-at-ones-feet reflection) immutable string or a mutable char[]. I don't know if I can safely consolidate aliases to reduce space, for example. Something could implement IString and not be fixed-length, so a SubString could move in and out of being in a valid state (unless the rule was that out-of-range means an empty string).

At the very least I think we're likely to require an IsImmutable property. (char[] does have an IsReadOnly implementation already, though perhaps the fact that it returns false for zero-length [arguably a bug but also arguably too breaking to change] is at least sub-optimalt).

Could we make concatenation work as conveniently? I like being able to a + b rather than a.Concat(b), but don't think even hacking a + override into the CIL will work (IIRC it's valid CIL but the C# compiler doesn't pick it up in overload matching).

On a similar note while explicitly-qualified comparisons (case-sensitive/insensitive, given cultures, etc.) are obvious enough in their analogy to those on string being able to == and to a lesser extent <, >=, etc. for an ordinal comparison would be nice to retain.

I feel that IReadOnlyList is an unattractive name

I think that's a valid objection, but not a strong one. If heavy use of IReadOnlyList<char> was made then eventually it would become first known as "a trick for performance and flexibility", then "the common framework pattern" and eventually idiomatic .NET coding. Ironically it does not offer the ability to know an object is truly read-only I mention above.

@jamesqo
Copy link
Contributor

jamesqo commented Dec 18, 2016

Is immutable here "I'm not going to let you mutate this" or "I promise nothing will mutate this? Does that distinction need to be made/understood?

Very good point. I could imagine a lying IString implementation that returned Robert when first enumerated and Robert'; drop table students when enumerated again, leading to security vulnerabilities. Concrete types are the only way to guarantee immutability.

@yufeih
Copy link
Contributor

yufeih commented Dec 19, 2016

If we have dotnet/roslyn#7451, then IString could be just a global alias of IReadOnlyList

@pharring
Copy link
Contributor Author

@JonHanna I would prefer it to be "I promise that nothing mutates this". I don't think it's an undue burden on implementors, but it's worrying that an unadorned char[] would fail to fit the bill. char[]'s IsReadOnly property comes from IList and it's always false. Side note: One of the ideas we came up with when working on Immutable Collections was the ability to 'freeze' an array. It requires deep support from the CLR to enforce. I don't think that ever got turned into a proposal.

@jamesqo That example would, IMO, indicate a broken implementation. Unfortunately, I don't know how to enforce it. To be clear, though, we want "observable immutability", though. i.e. implementations should be allowed to be lazy and to mutate their internal representations for efficiency. As long as the sequence stays the same...

On the naming thing:

I think that's a valid objection, but not a strong one. If heavy use of IReadOnlyList was made then eventually it would become first known as "a trick for performance and flexibility", then "the common framework pattern" and eventually idiomatic .NET coding.

Totally agree. That's why I wanted to have this discussion before we committed #19609.

@jamesqo
Copy link
Contributor

jamesqo commented Dec 19, 2016

That example would, IMO, indicate a broken implementation. Unfortunately, I don't know how to enforce it.

It doesn't matter if it's broken-- the fact that someone can even write such a thing opens the doorway to lots of malicious inputs for methods accepting IString.

Plus, there's another side to your proposal you didn't mention-- having an interface will severely slow down indexing into the string. Right now, the string indexer is basically just a pointer add/dereference. For IString, the indexer will be an additional virtual method call per char.

@pharring
Copy link
Contributor Author

I agree, but I don't think that problem is limited to IString. IReadOnlyList<T> would be no better. I can imagine a malicious implementation of System.Collections.Immutable.IImmutableList<T>, for example. I don't know how to represent "idempotency" in the current type system - or even if that's necessary and sufficient. Maybe type theorists can tell us.

As for indexing perf: Adding the interface doesn't change anything for existing consumers of System.String, AFAIK. Certainly, the JIT compiler has special knowledge of string and optimizes accordingly. Yes, those optimizations go away when you access through an interface; not only due to the indirect call, but also the additional range checking that the JIT compiler could otherwise hoist or eliminate. For some string representations (encodings, in particular), indexing is especially costly; enumeration is preferred. Consumers should be encouraged to do enumeration with foreach instead of for.

@Nirmal4G
Copy link

Nirmal4G commented Jun 4, 2018

With UTF8String in corefxlab etc... This would be a nice addition to .NET!

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label Jul 6, 2020
@joperezr joperezr modified the milestones: 5.0.0, Future Jul 6, 2020
@stephentoub
Copy link
Member

With the advent of ReadOnlySpan<char>, it has become the canonical way to write operations that can process strings, substrings, arrays of chars, etc., effectively making it an exchange type similar to what's been requested here. I don't see us creating another. I'm going to close this. Thanks.

@pharring
Copy link
Contributor Author

Reading back through the thread, I'm impressed by my younger self. That fellow made a convincing case! However, I'm glad that more enlightened people ignored him and found an even better solution. Yes, Span<T> and ReadOnlySpan<T> really are superior, covering more scenarios.

@stephentoub
Copy link
Member

Heh. Thanks, Paul 😄

@ghost ghost locked as resolved and limited conversation to collaborators Apr 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Runtime design-discussion Ongoing discussion about design without consensus
Projects
None yet
Development

No branches or pull requests