Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support CORRESPONDING specification in set operations #17891

Merged
merged 12 commits into from
Aug 7, 2024

Conversation

yuhao-su
Copy link
Contributor

@yuhao-su yuhao-su commented Aug 1, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

support corresponding spec in set operation

#17451

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

set operation SQL syntax query1 UNION/EXPECT/INTERSECT query2 requires query1 and query2 return the same number of columns and match the columns in left-to-right order (by column index).

With corresponding specification, the set operation matches columns by the column names. Only columns that are in both side will be overlayed. It will suppress columns that are not in both side of the set operation. Columns are considered in both side if they have the same name or alias

You can specify the columns need to be overlayed in the set operation by putting the column names in <corresponding column list>. Only columns that are in both side and the corresponding column list will be overlayed.

<corresponding spec> can be used after UNION/EXPECT/INTERSECT [ALL]
<corresponding spec> ::= CORRESPONDING [ BY <left paren> <corresponding column list> <right parent> ]

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@yuhao-su yuhao-su added the user-facing-changes Contains changes that are visible to users label Aug 1, 2024
@graphite-app graphite-app bot requested a review from a team August 1, 2024 09:46
Comment on lines 284 to 285
"At least one column of the left side shall have a <column name> that is the \
<column name> of some column of the right side"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"At least one column of the left side shall have a <column name> that is the \
<column name> of some column of the right side"
"When using Corresponding keywork, at least one column of the left side shall have a <column name> that is the \
<column name> of some column of the right side"

@xxchan
Copy link
Member

xxchan commented Aug 6, 2024

support corresponding spec in set operation

What does "spec" mean here?

@xxchan xxchan changed the title feat: set operation corresponding spec feat: support CORRESPONDING clause in set operations Aug 6, 2024
@yuhao-su
Copy link
Contributor Author

yuhao-su commented Aug 6, 2024

support corresponding spec in set operation

What does "spec" mean here?

https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

CORRESPONDING is not a clause. It's a specification.

for (idx, field) in set_expr.schema().fields.iter().enumerate() {
if name2idx.insert(field.name.clone(), idx).is_some() {
return Err(ErrorCode::InvalidInputSyntax(format!(
"duplicated column name `{}` in a select list of a set operation",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to indicate which side of which operation, instead of "a select list of a set operation"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, better to add a test case for this error. I didn't see one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be more than 2 side. If we have a A union corresponding B union corresponding C. It can be confusing to if B is a left or right side

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Would that still be better than nothing, since we can exclude C? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful for debugging. But I'm just worried users will be confused about which one is left when they see three unioned query.

I do think printing the whole column list in error msg makes sense know.

Comment on lines 261 to 262
"Every <column name> in the <corresponding column list> shall be a \
<column name> of both left and right side. Missing column: `{}`",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. "corresponding column list" seems a little verbose. "corresponding clause"?
  2. Why using <>? This seems unnecessary and inconsistent with our other error messages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And better to indicate which side is missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And better to indicate which side is missing.

Ditto

Comment on lines 284 to 285
"When CORRESPONDING is specified, at least one column of the left side shall have a <column name> that is the \
<column name> of some column of the right side"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean

Suggested change
"When CORRESPONDING is specified, at least one column of the left side shall have a <column name> that is the \
<column name> of some column of the right side"
"When CORRESPONDING is specified, left and right side should have at least one common column name"

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left or right side does not have column name, the columns in left or right have column names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using <>?

I can remove this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left or right side does not have column name, the columns in left or right have column names.

Sorry, I do not understand. Can you explain to me like I'm a normal user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left or right side are querys. A query does not have column name. The columns/select items in a query have column name.

left and right side should have at least one common column name

This can be rephrased to left and right side should have at least one column with identical column names

The original one I use is from the SQL 92 doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A query does not have column name. The columns/select items in a query have column name.

Well, this sounds too language lawyer to me.. 🤡
I think "A query's column name" is clear enough to me. Do you really think its vague? 🤔 (Specification needs to be more accurate though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really think its vague?

I can definitely understand that expression perfectly. But using more accurate language can avoid ambiguity for users, especially CORRESPONDING is not a wildly available feature is other DB.

@xxchan
Copy link
Member

xxchan commented Aug 6, 2024

I doubt the release note is not understandable. Could you try to improve it using clearer language?

I can imagine some possible questions asked by users:

  • Does CORRESPONDING BY need all column match? What's the behavior for columns not included?
  • What is CORRESPONDING's behavior?

Besides, the error messages can be

@st1page
Copy link
Contributor

st1page commented Aug 6, 2024

I doubt the release note is not understandable. Could you try to improve it using clearer language?

I can imagine some possible questions asked by users:

  • Does CORRESPONDING BY need all column match? What's the behavior for columns not included?
  • What is CORRESPONDING's behavior?

Besides, the error messages can be

Thanks for the reviewing for the doc part.

@yuhao-su yuhao-su changed the title feat: support CORRESPONDING clause in set operations feat: support CORRESPONDING specification in set operations Aug 6, 2024
@yuhao-su
Copy link
Contributor Author

yuhao-su commented Aug 6, 2024

indicate which side of which operation

Do you think print the column name list in the error msg can help? @xxchan

@xxchan
Copy link
Member

xxchan commented Aug 6, 2024

indicate which side of which operation

Do you think print the column name list in the error msg can help? @xxchan

Well, it depends.. You can decide yourself according to the principles. (Make it easier for users to locate and fix the problem.) Or give up if you don't find any good ways.

@xxchan
Copy link
Member

xxchan commented Aug 6, 2024

BTW, just searched a little bit why PG doesn't have this, and found a long thread about adding the feature and it was abandoned due to complexity: https://www.postgresql.org/message-id/flat/CAFj8pRCspwFhF1JVD-GKJS0rsojBC1r6okFmDtfVKiSu2XwD0g%40mail.gmail.com#f19dd33222923fc2310b4bd2c1b15c25

I worked on this for awhile but eventually decided that it's not very
close to being committable. The main thing that's scaring me off is
a realization that there are a lot of places that assume that the
output columns of a set operation are one-for-one with the columns of
the inputs.

Not sure whether we have such kind of problems.

@xxchan
Copy link
Member

xxchan commented Aug 6, 2024

Not sure whether we have such kind of problems.

I guess not, because we just inserted project.

And I found an interesting sentence in the discussion:

It probably does make sense for the actual implementation to involve
inserting projection nodes below the setop. I'm pretty sure the executor
code for setops expects to just pass input tuples through without
projection. You could imagine changing that, but it would add a penalty
for non-CORRESPONDING setops, which IMO would be the wrong tradeoff.

https://arc.net/l/quote/elyhevcv

So I guess we won't have problems since our executors don't have such level of optimizations. 🤣

@graphite-app graphite-app bot requested a review from a team August 6, 2024 15:55
@yuhao-su yuhao-su requested a review from xxchan August 7, 2024 03:18
db error: ERROR: Failed to run the query

Caused by:
Invalid input syntax: When CORRESPONDING is specified, at least one column of the left side shall have a column name that is the column name of some column of the right side in a UNION operation. Left side query column list: ("v1", "v2", "v4"). Right side query column list: ("vxx").
Copy link
Contributor Author

@yuhao-su yuhao-su Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xxchan should remove the column list of this too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can decide it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep this considering it can be fairly difficult to identify the faulty query without any information.

@yuhao-su yuhao-su requested a review from xxchan August 7, 2024 04:09
@yuhao-su yuhao-su added this pull request to the merge queue Aug 7, 2024
Merged via the queue into main with commit fb62114 Aug 7, 2024
29 of 30 checks passed
@yuhao-su yuhao-su deleted the yuhao/corresponding branch August 7, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants