[Discussion] Revisiting pattern matching on collections #1039

DavidArno · 2017-10-24T14:31:49Z

DavidArno
Oct 24, 2017

My aim with this discussion is to explore how pattern matching could be supported over any IEnumerable<T>. Of course, this immediately raises a problem: how do we avoid re-enumerating the collection as we pattern match? Further, how do we avoid the need for an enumerator, for as @gafter says, patterns shouldn't generally be executing methods like GetEnumerator/MoveNext?

Taking the second concern first, typical applications of pattern matching over a collection involve some sort of matching against the head and then recursively re-applying the patterns to the tail. Leaving this as a recusive implementation leads too easily to a stack overflow. So tail-call optimisation is used to turn it into an iterative implementation. Such an implementation naturally lends itself to using an enumerator. So my gut feeling here is that trying to avoid GetEnumerator and MoveNext is not practical. Whilst it's highly likely that folk way more clever than me have found a way around this, eg for F#'s pattern matching over linked lists, the rest of this discussion works on the assumption that this isn't really a problem.

Regarding the problem of re-enumerating an IEnumerable<T>, the two solutions I have come up with for this is (1) to cache the enumeration as it is read, using a linked list and (2) copy it to an array (or other indexed collection) first.

The advantages I see to the linked list approach are:

The collection only needs to be enumerated as deeply as needed (eg to determine no pattern matches the start of the collection),
Each node effectively forms the head of a sub collection, allowing easy support of recursive matching,
The original enumeration, and its enumerator, can be stored in the last node. This allows the linked list to be "self contained".

However, the linked list approach has an obvious disadvantage over copying to an array, namely that elements cannot be accessed via an index and must be found be starting at the beginning of the list each time. This would likely add huge overheads to all but the most trivial of patterns.

For the rest of this discussion, I'll ignore which method is used as the correct way to determine the best solution would be through performance testing, rather than me guessing.

Basic pattern matching

Taking ideas from Roslyn #10631 (ie, @gafter's link above), the simplest form of pattern matching would match the whole collection. I've changed the syntax here as @alrz points out that the proposed property pattern syntax would likely clash with the ideas of using {} for those collection patterns. So I've used [] here (until someone points out how that will clash too!)

if (collection is [1, 2, 3]) 
    // true if collection contains exactly the three elements, [1,2,3]

if (collection is [])
    // empty collection matches

if (collection is [1, _, 3])
    // matches if collection contains exactly the three elements and
    // element 0 is 1 and element 2 is 3. Element 1 can be any value

if (collection is [1, var x, 3])
    // matches as above, but element 2's value is assigned to x

Element wildcard matching

Having to match the whole collection is restrictive though. Again borrowing the syntax from the above linked issue, we could use .. to denote zero or more elements:

if (collection is [.., 99, 100]) 
    // matches on the last two elements only

if (collection is [1, 2, ..]) 
    // matches on the first two elements

if (collection is [1, .., 100]) 
    // matches the first and last elements

Recursive matching

As I said at the beginning, such matches are rather limited when dealing with collections. What's commonly required is a means of recursing over each element in turn. A possible example of this in action is shown below:

int Sum(IEnumerable<int> collection) =>
    collection switch (
        case [] : 0, // empty collection handler
        case [var last] : last,
        case var head :: tail : Sum(tail) + head
    );

Also as previously mentioned, whilst it's useful to express the pattern recursively, tail call optimisation is normally used in such situations to avoid stack overflows.

There is a problem with the above syntax: the use of :: as a cons operator. I proposed the idea of adding cons to C# a while ago. This was met with scepticism, but also @svick pointed out, :: is already used for namespace aliases.

To avoid this, I considered the idea of the framework providing a deconstruct for IEnumerable<T>, that split it into the head and tail. This creates an interesting new problem though, which I'll label the "fragile substitution problem". Put simply, if I've my own type that implements IEnumerable<T> that already defines a void Deconstruct(out T x, out IEnumerable<T> y), then that method would override the default one of IEnumerable<T>, but only at runtime, potentially causing a runtime failure.

The solution that I think avoids both of these issues and avoids yet more new syntax is to use the previously discussed [..] notation, but to allow the assignment of the .. part to a variable. For example,

if (collection is [1, 2, .. var theRest])
    // for array [1,2 3, 4], theRest will be [3, 4] here

So to split the head and tail, we could just use [var head, .. var tail], or even var [head, .. tail]:

int Sum(IEnumerable<int> collection) =>
    collection switch (
        case [] : 0, // empty collection handler
        case [var last] : last,
        case var [head, ..tail] : Sum(tail) + head
    );

Recursing just within the `switch`

The previous recursive example relies on the switch being the entire body of the method and/or for it to make sense to repeat the whole method for each iteration. This may not always be desirable though. A proposed way around this is to re-use switch to indicate a recursion is needed:

int Sum(IEnumerable<int> collection) =>
    collection switch (
        case [] : 0, // empty collection handler
        case [var last] : last,
        case var [head, ..tail] : switch(tail) + head
    );

Returning collections from the pattern match

For a pattern that needs to return an enumeration, I played around with various ideas on how to build the collection with a linked list, returning it when complete. This didn't feel right for IEnumerable<T> though. The latter supports yield and this seems a nice, C#-idiomatic way of handling patterns too, without any new types or syntax being needed:

IEnumerable<T> ReturnSelf<T>(IEnumerable<int> collection)
{
    switch (collection)
    {
        case [] : yield break; // empty collection handler
        case [var last] : yield return last;
        case var [head, .. tail] : 
            yield return head;
            switch(tail);
    }
}

To show this collection pattern matching in action, consider the situation where we have a sequence of values, eg

[1, 1, 1, 2, 5, 1, 1, 5, 5, 5, 5, 3]

And we want to reduce it to a collection of values and run-lengths:

[(1, 3), (2, 1), (5, 1), (1, 2), (4, 4), (3, 1)]

A way of achieving this, using pattern matching, is shown below:

IEnumerable<(T, int)> GetSequenceCounts<T>(IEnumerable<T> collection)
{
    let [head, .. tail] = collection else return new(T, int)[0]; 
    return SequenceCounts(tail, (head, 1));

    IEnumerable<(T, int)> SequenceCounts(IEnumerable<T> collection, 
                                         (T item, int count) current)
    {
    	switch ((collection, current))
        {
            case [] : yield current;
            case var [head, ..tail] when head == current.item : 
                switch (tail, (current.item, current.count + 1));
            case var [head, ..tail] :
                yield return current;
                switch (tail, (head, 1));
        }
    }
}

Answered by jcouv

Sep 18, 2021

@DavidArno I've just noticed this discussion. I think the latest iterations of list-patterns are getting pretty close to what you're proposing.

I've written a proposal for how we might implement list-patterns on enumerable types (ie. that are not indexable) without enumerating multiple times: https:/dotnet/csharplang/blob/main/proposals/list-patterns-enumerables.md
In short, we'd generate a wrapper type around the enumerable and buffer a few elements. More elements need to be buffered if the list-patterns are longer and involve slices at the start. This type provides a count-in-progress and emulates indexers for start-indexes (such as 0) and end-indexes (such as ^1).

Regarding…

View full answer

Opiumtm · 2017-10-24T15:07:55Z

Opiumtm
Oct 24, 2017

I think it's OK to iterate over IEnumerable<T> repeatedly.

It maybe awful for performance but it's still a case for the multiple foreach over a collection or for the LINQ queries over enumeration.

Developer should be aware of multiple iterations and could cache collection himself before pattern matching.

And then there is a method to reset enumerator to its initial state.

Developer could use multiple-iteration-friendly implementations of IEnumerable<T> in a case of pattern matching. I strongly against implicit iterator caching. There are cases (most probably when IEnumerable<T> is a collection returned by the yield return without any additional overhead) when it's safe to iterate over collection as many times as it required. Developer would know better if it's safe to iterate over collection multiple times.

0 replies

orthoxerox · 2017-10-24T15:08:09Z

orthoxerox
Oct 24, 2017

The biggest question to me is: why would someone pattern match against a collection? The last example in the post would probably be clearer when written as a loop.

0 replies

iam3yal · 2017-10-24T15:13:54Z

iam3yal
Oct 24, 2017

It looks great on paper but I really keep on asking myself what's the benefit of this over the traditional approach? and before this happens don't we need to standardize a syntax for collections? range? etc..

0 replies

svick · 2017-10-24T17:58:11Z

svick
Oct 24, 2017
Collaborator

My thoughts:

I think that pattern matching over collections is useful. I think it could make many collection-related coding patterns simpler.
I think that pattern matching should not enumerate a collection more than once. If possible, it should enumerate it less than once.

Caching might be necessary to achieve that. In that case, the cache should be lazy and based on List<T>: the cache is lazy, i.e. filled as the collection is enumerated for the first time, not all at once; it's based on List<T> to make it efficient, since linked lists are inefficient.
In general, I'm not convinced that recursive matching makes sense for C#. Yes, pattern matching comes from functional programming, but that does not mean pattern matching in C# should try to emulate functional languages as much as possible.

I think that recursively processing collections is not a good fit for C# (even if it had pattern matching). Instead, collection pattern matching should be made to work well with the imperative and iterative style that's idiomatic in C#.

C# has a tradition of taking functional concepts and trying to make them easier to understand to "traditional" programmers*, I think a similar approach should be used here.
Specifically, I don't like the idea of a recursive switch() expression or the implicit yield foreach it seems to be doing. If you do want to use a pattern recursively, local functions already offer a decent syntax for that. And yield foreach is useful, but it should be explicit; I think the code would be hard to understand otherwise.

For the GetSequenceCounts example, I tried to think of how I could write it most clearly, including using some form of pattern matching. But I ended up with C# 7.0 code:

IEnumerable<(T, int)> GetSequenceCounts<T>(IEnumerable<T> collection)
{
    var current = (value: default(T), count: 0);
    
    foreach (var item in collection)
    {
        // == doesn't actually compile here, but the original code used it too
        if (item == current.value)
        {
            current.count++;
        }
        else
        {
            if (current.count > 0)
                yield return current;
                
            current = (item, 1);
        }
    }
    
    if (current.count > 0)
        yield return current;
}

It's slightly more lines of code than the example in the original post, but it's also significantly less characters, which should make it easier to understand. (Also note that it mutates a tuple 😁.)

* Some examples, which I think were mostly successful:

It's not a curried first-class function (e.g. int -> int -> int), it's a simple generic delegate (Func<int, int, int>).
It's not a higher-order function (e.g. map), it's an SQL-like clause (select).
It's not a general computation expression inspired by monadic do notation, it's async-await.

0 replies

alrz · 2017-10-24T18:26:32Z

alrz
Oct 24, 2017

I don't think matching against IEnumerable<T> makes much sense (exactly because of the need for caching etc). Matching logic should be dead simple, could be limited to arrays and IList<T>.

I'm not really a fan of "cons patterns" in C# mostly because of (3) in @svick's comment. F# has it because list literals in F# create an immutable linked list, so a cons pattern is expected there. You can always write an extension pattern for any other data structures (like a link list, immutable list, etc).

0 replies

DavidArno · 2017-10-24T20:22:11Z

DavidArno
Oct 24, 2017
Author

@orthoxerox and @eyalsk,

With regard to the question, "why would one pattern match a collection", I asked that same question a while back. Ironically @orthoxerox provided the answer, so I'm amused he's now asking why one would do it!

A specific example can be found here:

But let's say you have an IReadOnlyList<Token> where two consecutive newline tokens should be parsed as a statement separator. I don't think you can use LINQ in an idiomatic way here, this looks bizarre to me:

tokens = tokens.Zip(tokens.Skip(1).Concat(new [] { Token.Dummy }), (t1, t2) => IsNewline(t1) && IsNewline(t2) ? new Token(Kind.Separator) : t1);

Using pattern matching, this becomes:

IEnumerable<Token> ConvertPairedNewlinesToTokens(IEnumerable<Token> tokenSet)
{
    switch (tokenSet)
    {
        case var [t1, t2, .. tail] when IsNewline(t1) && IsNewline(t2) :
            yield return Token(Kind.Separator);
            switch(tail);
        case var [t, .. tail] :
            yield return t;
            switch(tail);
        case [] :
            yield break;
    }
}

The syntax is longer^*, but to my mind at least, it's far simpler to follow.

^*I think the syntax could be improved hugely, but for now I'm sticking with C#-style syntax for this discussion to keep it looking familiar.

0 replies

DavidArno · 2017-10-24T20:30:54Z

DavidArno
Oct 24, 2017
Author

@svick,

I think that pattern matching should not enumerate a collection more than once ... Caching might be necessary to achieve that. In that case, the cache should be lazy and based on List<T>

I dismissed List<T> as I'd assumed it would be very inefficient. When the array buffer fills up, doesn't it create a new buffer and copy all the values over? I'm assuming I'm wrong here for you to be proposing it.

For the GetSequenceCounts example, I tried to think of how I could write it most clearly, including using some form of pattern matching. But I ended up with C# 7.0 code:

There's a bug in your code. If the first element of the sequence matches default(T), it will spit out an incorrect count. It's easily fixed by adding a firstPass flag, but it neatly highlights the complexity of expressing a fundamentally recursive problem with an iterative solution: all those extra edge cases and mutating variables to keep track of...

(and well spotted with regard to my naive use of == 😉)

0 replies

svick · 2017-10-24T20:53:23Z

svick
Oct 24, 2017
Collaborator

@DavidArno I think that that token parsing problem is indeed a good example of something that current C# doesn't handle well, while functional-style pattern matching does.

Trying to fit that into my "imperative pattern matching" idea, I think it could look something like:

IEnumerable<Token> ConvertPairedNewlinesToTokens(IEnumerable<Token> tokenSet)
{
    while (true)
    {
        switch (tokenSet)
        {
            case [var t1, var t2, .. tokenSet] when IsNewline(t1) && IsNewline(t2):
                yield return Token(Kind.Separator);
                break;
            case [var t, .. tokenSet]:
                yield return t;
                break;
            case []:
                yield break;
        }
    }
}

I also had a crazy idea: introduce a foreach switch:

IEnumerable<Token> ConvertPairedNewlinesToTokens(IEnumerable<Token> tokenSet)
{
    foreach switch (tokenSet)
    {
        case var [t1, t2] when IsNewline(t1) && IsNewline(t2):
            yield return Token(Kind.Separator);
            break;
        case var [t]:
            yield return t;
            break;
    }
}

I'm not sure adding a foreach switch would actually make sense, but I do find the idea interesting.

0 replies

bondsbw · 2017-10-24T20:59:56Z

bondsbw
Oct 24, 2017

Could this syntax clash with attribute pattern matching in #807?

private bool Serializable { get; set; }

...

if (T is [Serializable]) ...

0 replies

svick · 2017-10-24T21:09:03Z

svick
Oct 24, 2017
Collaborator

@DavidArno

I dismissed List<T> as I'd assumed it would be very inefficient. When the array buffer fills up, doesn't it create a new buffer and copy all the values over? I'm assuming I'm wrong here for you to be proposing it.

Yes, but that copying happens infrequently; amortized time complexity of adding an item is still O(1). In practice, I believe the overheads of linked lists are much bigger, which is why real C# code uses List<T> very often, while LinkedList<T> is almost never used.

There's a bug in your code. If the first element of the sequence matches default(T), it will spit out an incorrect count.

I don't think so. The count is initialized to 0, so if the first item matches default, the count is correctly incremented to 1. And GetSequenceCounts(new[] { 0, 0, 1, 1 }) correctly returns { (0, 2), (1, 2) } for me.

0 replies

DavidArno · 2017-10-25T13:25:59Z

DavidArno
Oct 25, 2017
Author

@svick,

Hmm, you are right. It does work. So we can conclude that it's actually me that has difficulty understanding code that solves problems with loops.

I love your idea of the foreach switch syntax, though you aren't helping my cause here by suggesting such neat ideas! 😀

@bondsbw,
Yes this proposal would clash with pattern matching attributes, which is annoying as I like that idea too. 😞

0 replies

Richiban · 2017-10-31T18:33:05Z

Richiban
Oct 31, 2017

@Opiumtm

I think it's OK to iterate over IEnumerable repeatedly.

Definitely not, I'm afraid. Since they're ultimately generated with a yield return statement an IEnumerable is not an in-memory collection (no matter how much most developers treat them as such).

An IEnumerable<T> is just an object that will repeatedly spit out Ts until either it decides it's finished or you tell it to stop. There's no guarantee that it will ever stop or that, if you enumerate it again, you'll get the same items that you did before.

public class EnigmaticObject : IEnumerable<int>
{
	private static readonly Random Random = new Random();
	
	public IEnumerator<int> GetEnumerator()
	{
		var count = 10;

		while (count-- > 0)
		{
			yield return Random.Next();
		}
	}
	
	IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

0 replies

Richiban · 2017-11-01T10:14:26Z

Richiban
Nov 1, 2017

I do, however, support the idea of pattern matching on collections. Maybe only on persistent or recursive data structures?

I'm thinking that there are many types that can be destructured into a tuple, such as a cons-style list that could be split into its head and tail:

var (hd, tl) = list;

So maybe

var [x, y, z] = a

can be shorthand for the recursive pattern:

(var x, (var y, (var z, _))) = a

I'm just brainstorming here, but it seems possible, if a type can be split into two, to be able to recursively split it further.

This could also work with Array Slices, when they arrive.

0 replies

bondsbw · 2017-11-01T15:22:48Z

bondsbw
Nov 1, 2017

@Richiban Is list deconstruction into tuples part of an existing proposal?

0 replies

alrz · 2020-02-29T18:18:42Z

alrz
Feb 29, 2020

Since all the required infra is being done as part of C# 9.0, I've started to work on a proposal for this feature, this is the latest revision:

https://gist.github.com/alrz/84addd150849a0b8c014deb85b75211d/

I plan to implement a prototype for the first two sections with the hopes that it'll get championed. Feedback welcome.

2 replies

julealgon Nov 5, 2020

Link appears to be broken. Do you have any updates?

alrz Nov 5, 2020

@julealgon Championed at #3435

jcouv · 2021-09-18T05:31:11Z

jcouv
Sep 18, 2021
Collaborator

@DavidArno I've just noticed this discussion. I think the latest iterations of list-patterns are getting pretty close to what you're proposing.

I've written a proposal for how we might implement list-patterns on enumerable types (ie. that are not indexable) without enumerating multiple times: https:/dotnet/csharplang/blob/main/proposals/list-patterns-enumerables.md
In short, we'd generate a wrapper type around the enumerable and buffer a few elements. More elements need to be buffered if the list-patterns are longer and involve slices at the start. This type provides a count-in-progress and emulates indexers for start-indexes (such as 0) and end-indexes (such as ^1).

Regarding the head/tail pattern, I don't think we need to use Deconstruct(head, tail). The basic indexing and slice mechanism should suffice (indexing for the head and slice for the tail).

2 replies

DavidArno Sep 20, 2021
Author

@jcouv, I'd forgotten about this To be honest. I agree with you that the latest proposals are indeed pretty close to this, so it makes sense to mark this as answered. Thanks.

alrz Sep 23, 2021

I think Deconstruct(head, tail) is actualy a prefect solution if the data structure is formulated as head+tail so that the match directly applies to type internals. Implementing Slice for such type will be more involved than just Deconstruct(head, tail).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Revisiting pattern matching on collections #1039

{{title}}

Replies: 16 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Discussion] Revisiting pattern matching on collections #1039

Basic pattern matching

Element wildcard matching

Recursive matching

Recursing just within the switch

Returning collections from the pattern match

Replies: 16 comments · 4 replies

svick Oct 24, 2017 Collaborator

DavidArno Oct 24, 2017 Author

DavidArno Oct 24, 2017 Author

svick Oct 24, 2017 Collaborator

svick Oct 24, 2017 Collaborator

DavidArno Oct 25, 2017 Author

jcouv Sep 18, 2021 Collaborator

DavidArno Sep 20, 2021 Author

Recursing just within the `switch`

Replies: 16 comments 4 replies

svick
Oct 24, 2017
Collaborator

DavidArno
Oct 24, 2017
Author

DavidArno
Oct 24, 2017
Author

svick
Oct 24, 2017
Collaborator

svick
Oct 24, 2017
Collaborator

DavidArno
Oct 25, 2017
Author

jcouv
Sep 18, 2021
Collaborator

DavidArno Sep 20, 2021
Author