Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is concatenation going to be written + or something else? #457

Closed
josh11b opened this issue Apr 13, 2021 · 14 comments
Closed

Is concatenation going to be written + or something else? #457

josh11b opened this issue Apr 13, 2021 · 14 comments
Labels
leads question A question for the leads team

Comments

@josh11b
Copy link
Contributor

josh11b commented Apr 13, 2021

For example, will you concatenate strings with + or another operator? Context: https://en.wikipedia.org/wiki/Concatenation

Advantage of +:

  • Matches C++, Python.
  • Don't have to allocate an extra symbol.

Advantage of something else:

  • Concatenation is not commutative, so + is unnatural mathematically.
  • Would allow us to distinguish vector concatenation from vector addition.
@zygoloid
Copy link
Contributor

Even if there were no types that support both addition and concatenation, using distinct operators for addition versus concatenation seems reasonable for expressivity purposes. And I think concatenation is an operation that's common enough that allocating an extra symbol for it would be reasonable; I don't consider allocating an extra symbol to be a significant disadvantage. However, it does seem to create a potential interoperability friction with C++. When calling Carbon from C++, would we map a C++ + to both addition and concatenation operators and see which one works? When calling C++ from Carbon, would we map a Carbon + and a Carbon concatenation operator to the single C++ + operator? Or maybe we could add a source-level annotation on the C++ side to say which kind of plus an operator+ provides?

If we are able to support a single operator token being all of prefix, infix, and postfix, we could use an infix ++ for this purpose (not to avoid allocating an operator, but simply because it seems to have familiar "additive" connotations while not being +); this symbol is used for concatenation in some functional programming languages.

Concatenation is not commutative, so + is unnatural mathematically.

I think there's a broader question here: do we want to have some notion of semantics associated with our (overloadable) operators, or is it just a free-for-all? For example, should we be able to assume that + and * form a ring, and perhaps even optimize accordingly?

I think we need to be careful here, and consider how we would fit floating-point types and integer overflow into such a model. Commutativity of + is rare in being a mathematical property that actually holds for integer and floating-point types (at least, if we view the choice of which payload to propagate in NaN + NaN as being nondeterministic). I wonder how much we should worry about ensuring this property holds, if we can't do that for any other mathematical property. Perhaps we can still provide certain guidelines about how overloads should work, even if they are not enforceable or even strictly true in practice -- it seems helpful to a reader to be assured that + does something plusish -- but that sounds like a style guide rule rather than a language rule. Nonetheless if we have a separate concatenation operator I would expect people to use it rather than overloading + to work on strings.

@tkoeppe
Copy link
Contributor

tkoeppe commented Apr 13, 2021

Concatenation is not commutative, so + is unnatural mathematically.

I think there's a broader question here: do we want to have some notion of semantics associated with our (overloadable) operators, or is it just a free-for-all? For example, should we be able to assume that + and * form a ring, and perhaps even optimize accordingly?

This already assumes that these binary operators are homogeneous (T, T) -> T, and precludes simple affine constructions like pointer and iterator arithmetic. Is that a restriction we would want?

@geoffromer
Copy link
Contributor

I think that the canonical syntax for concatenating N strings in Carbon should have optimal performance in at least the following senses:

  • It copies each byte of string data exactly once.
  • It does not copy or move any string objects.
  • It performs at most one heap allocation.

It's far from clear that an infix binary operator will be able to satisfy these requirements, regardless of how we spell it; as far as I can tell, the only way to do so is to for the language to provide some way of transforming a chain of operators into the equivalent of a single function call. That could take the form of something like expression templates, or some sort of bespoke language-level rewrite rule as discussed in #451, or maybe some kind of static reflection, or who knows what else.

On the other hand, I am very confident that a function call syntax will be able to satisfy these requirements in Carbon, because it can already do so pretty straightforwardly in C++ (see absl::StrCat for an example), so a function-call syntax seems like the option that will impose the least burden on the design of the language. I think it will also impose the least burden on the reader, because the syntax directly corresponds to the semantics, without an intervening transformation step, and because we can use a meaningful name rather than repurposing or inventing some punctuation mark.

@jonmeow
Copy link
Contributor

jonmeow commented Apr 15, 2021

I think it's worth covering cross-language precedent a bit more if breaking away from C++ syntax, particularly for string concatenation. Going through the top 20 at https://pypl.github.io/PYPL.html, comparing with https://rosettacode.org/wiki/String_concatenation:

@zygoloid
Copy link
Contributor

There are a collection of operations in this space, including at least these:

  • Given a format string and some arguments, insert the arguments into the string ("interpolation").
  • Given a desired output location (which might be a string or a file or a device or similar), incrementally feed it strings that are appended to that location ("streaming").
  • Given two strings, form a third string that is the concatenation of those two as efficiently as possible ("concatenation").
  • Given only a list of arguments, format them in a canonical way and append them, as if formatting with a format string "%0%1%2..." or similar ("StrCat"). Note that this is a special case of both interpolation and concatenation.

We presumably want some combination of these operations to be available, but not necessarily all of them. What use cases do we want to address with concatenation in particular rather than one of the other operations?

Given @geoffromer's comment and Carbon's efficiency goal, we should consider eschewing concatenation in favor of other options.

@geoffromer
Copy link
Contributor

  • Given two strings, form a third string that is the concatenation of those two as efficiently as possible ("concatenation").

I suggest calling this "binary concatenation"; in my experience the term "concatenation" is often applied to APIs taking an arbitrary number of operands (e.g. StrCat, the unix cat command, etc), so trying to use it in this more restrictive sense seems likely to cause confusion.

  • Given only a list of arguments, format them in a canonical way and append them, as if formatting with a format string "%0%1%2..." or similar ("StrCat"). Note that this is a special case of both interpolation and concatenation.

I think that's a generalization of binary concatenation rather than a special case, for two reasons. The boring reason is that "an arbitrary number" is a generalization rather than special case of "two", but the interesting reason is the addition of "format them in a canonical way": as defined, this operation can take any operand that has a canonical format-as-string operation, whereas concatenation was defined to take only string operands.

Interpolation and streaming can likewise be generalized to support arbitrary string-formattable types (e.g. printf and <<, respectively). In principle binary concatenation could be generalized in that way, but I've never seen that done in practice, possibly because such a generalized binary concatenation operation can't be spelled as infix + (or any other overloaded spelling), because that would lead to ambiguity about e.g. whether 1 + 2 is 3 or "12".

I should note that my use of StrCat as an example wasn't intended to focus on the fact that it supports non-string types, and in fact I'm somewhat skeptical of generalizing concatenation (or streaming) in that way. But to the extent that we do want to generalize concatenation in that way, that's another reason to avoid using a binary operator syntax for it.

@zygoloid
Copy link
Contributor

zygoloid commented May 17, 2021

Interpolation and streaming can likewise be generalized to support arbitrary string-formattable types [...]. In principle binary concatenation could be generalized in that way, but I've never seen that done in practice

This is commonplace in dynamically-typed languages -- for example, JavaScript and Perl both do this. VBScript does too, but uses & as the formatting binary concatenation operator rather than +, so eg 1 + 2 is 3 but 1 & 2 is "12" (though "1" + "2" is also "12", so it's not the case that + is only a numeric operation). It seems like a desirable operation in at least some domains, but I share your inclination to avoid using a binary operator for this purpose. But perhaps that doesn't strongly inform the question of whether we should expose a non-formatting binary concatenation operator as +.

@github-actions
Copy link

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please comment or remove the inactive label. The long term label can also be added for issues which are expected to take time.
This issue is labeled inactive because the last activity was over 90 days ago.

@github-actions github-actions bot added the inactive Issues and PRs which have been inactive for at least 90 days. label Aug 16, 2021
@lexi-nadia
Copy link

lexi-nadia commented Jul 26, 2022

This already assumes that these binary operators are homogeneous (T, T) -> T, and precludes simple affine constructions like pointer and iterator arithmetic. Is that a restriction we would want?

Context: https://en.m.wikipedia.org/wiki/Level_of_measurement

I think the examples you bring up, along with others (timestamps; temperatures in °C or °F), all fit vaguely under the "interval scales" category -- they're references with no magnitude.

For an interval type T and a cprresponding offset type O, we need to support the following options:

  1. Offset: T ± O -> T
  2. Difference: T - T -> O

Hypothetically, we could have different spellings for these operations, but it's hard to imagine this being a problem in practice. The types are different. (I might even suggest that we use different interfaces to represent these operations, even if they map to the same operator tokens.)

Concatenation is a very different case for me. Rather than defining new operations, it reuses T + T -> T in a way that breaks commutativity. Multiplication barely makes sense here, and subtraction and division make no sense at all. (And there are cases, like vectors, where both addition and concatenation could make sense!) That's why i'm so uneasy about using + for this case; it's just not addition-like at all.

@lexi-nadia
Copy link

(As an aside, i think vector concatenation may be a better motivating example than string concatenation.)

@github-actions github-actions bot removed the inactive Issues and PRs which have been inactive for at least 90 days. label Jul 26, 2022
@jonmeow jonmeow added the leads question A question for the leads team label Aug 10, 2022
@mossaiby
Copy link

Let me add that in Julia, the string concatenation is performed using *. It was odd for me at first, but became natural when I thought of it as in math; a * b = ab, hence 'a' * 'b' = 'ab'. Hope it helps.

@chandlerc
Copy link
Contributor

Fundamentally, I think Carbon should consider the + operator symbol to represent some kind of "add" operation where "add" is an abbreviation for addition. The language should not decide that + can also mean concatenation, it should only consider it addition.

Whether it makes sense to use that language concept of addition for a type to mean concatenation is a question for the author of that type.

I can imagine types which have really good reasons that the only possible and useful model of addition on the type is concatenation. But I can also imagine many, many types for which that is not the case. @lexi-nadia gives a great example of vectors.

I don't think we can hope for Carbon to have the language indicate one way or another here, at least not at this stage.

This in turn raises a few tightly related questions we need to answer to close this out:


First: should we add a new operator symbol to Carbon so that it could be a language-level symbol for concatenation?

I suggest that we do not do this (for now). I think there are a lot of healthy ideas for how to make even types where concatenation is unambiguously not well aliased to addition reasonably ergonomic.

We can revisit this in the future if we get substantial information indicating that many types would benefit from this expansion of our operator set. But operators are a reasonably expensive syntactic space to begin with, and I think at least today I don't see nearly enough motivation. I would much prefer to invest in other syntax tools that address the same or similar use cases.


Second: should strings in Carbon use addition to mean concatenation?

I somewhat strongly think this is not the right direction.

There are many challenges with this model raised in this thread already. I'll add one more that for me is particularly important: strings are especially common to be accidentally used instead of some other type. There is even the joking phrase "stringly typed APIs" because strings get (over)used so often when there is notionally some other typed data serialized within the string.

Because of this frequent type confusion, I think we should be especially careful in using expression syntaxes with strings that might be surprising to apply to a string. I worry that indeed, they will, and the code will become harder to read as the reader reasonably assumes the wrong type rather than deducing an unconventional operation.

Combining this with the other issues, I think we should focus on techniques like string interpolation, APIs like StrCat, etc. These will still give us good migration strategies for C++ code that uses string addition, and will IMO at least result in more readable code.


Aside: I'd prefer to not anchor this around commutativity FWIW. I think that we should start without assuming commutativity for operators, even where it is extremely common. For example, if we assume + is commutative I think it will be surprising that we don't assume * is commutative. But * has many more cases where this isn't appropriate. If we want to add the ability to reason about commutativity of operators, I think we should do so in a way that can be controlled by the types in question so that both commutative and non-commutative case can be supported for the same syntax, and so that we can use the tools for any particular operator.

@eeshvardasikcm

This comment was marked as off-topic.

@zygoloid
Copy link
Contributor

Leads decision: we follow @chandlerc's most recent comment. + is overloadable but we intend for it to mean "add" not "concatenate", and Carbon's string type will use some other mechanism for string concatenation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leads question A question for the leads team
Projects
None yet
Development

No branches or pull requests

9 participants