Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: selective aggregates #3506

Closed
likg227 opened this issue Jun 27, 2022 · 1 comment
Closed

RFC: selective aggregates #3506

likg227 opened this issue Jun 27, 2022 · 1 comment
Assignees

Comments

@likg227
Copy link
Contributor

likg227 commented Jun 27, 2022

Background

This is a useful feature according to some articles. This feature is supported by PostgreSQL and we can use it to implement distinct aggregation.

Design

Syntax: AGG(<expression>) FILTER(WHERE <condition>), e.g. sum(v1) FILTER(WHERE v1 < 5).

The FILTER clause extends aggregate functions (sum, avg, count, …) by an additional WHERE clause. The result of the aggregate is built from only the rows that satisfy the additional WHERE clause too.

In frontend, we can add Condition for PlanAggCall . As for the backend, is it easy to implement this feature? cc @StrikeW

Discussions

An alternative approach is using CASE .. WHEN .. to wrap aggregate's argument. Generally, we just need to use CASE .. WHEN .. to wrap the arguments and compute it in LogicalProject below LogicalAgg. However, we have to insert an extra LogicalProject when we want use this alternative approach to implement distinct aggregation, which might bring more performance cost.

In conclusion, selective aggregates are clearer and more convenient.

Reference

PostgreSQL document

Modern SQL

@fuyufjh
Copy link
Member

fuyufjh commented Jun 28, 2022

+1 for this.

By the way,

An alternative approach is using CASE .. WHEN .. to wrap aggregate's argument.

The "alternative approach" doesn't seem to work perfectly. CASE WHEN does not provide the filtering semantic exactly. Assume a Project with CASE WHEN was added under Agg operator, it must still put something (i.e. generated by the then branch) to the aggregation function. The best choice here must be a NULL because almost all agg functions simply ignore NULLs in their implementation, that is: AGG({..., NULLs}) == AGG({...}), but this property may be broken for some corner cases or UDAF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants