Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got rid of redundant JOINs/WHEREs of WhereInWalker #10917

Open
wants to merge 8 commits into
base: 3.3.x
Choose a base branch
from

Conversation

SerheyDolgushev
Copy link
Contributor

@SerheyDolgushev SerheyDolgushev commented Aug 23, 2023

Doctrine\ORM\Tools\Pagination\Paginator is an amazing tool that lets you fetch entities in chunks. And it sends two separate SQL queries:

  1. The first query to get unique entity IDs
  2. The second query is fetching the actual entity data. This query uses the result of the previous one as an additional filter.
    For simplicity let's call them id and data queries.

To demonstrate it, lets use CmsUser and CmsAddress entities:

...
$query = new Query($this->em);
$query->setDQL('SELECT u
    FROM Doctrine\\Tests\\Models\\CMS\\CmsUser u
    JOIN Doctrine\\Tests\\Models\\CMS\\CmsAddress a
    WHERE a.city = :filterCity');
$query->setParameters(['filterCity' => 'London']);
$query->setMaxResults(1);
$paginator = (new Paginator($query, true))->setUseOutputWalkers(false);
$paginator->getIterator();

The code above will trigger two SQL queries:
id

SELECT DISTINCT c0_.id AS id_0
FROM cms_users c0_
INNER JOIN cms_addresses c1_
WHERE c1_.city = ? LIMIT 1

data

SELECT
    c0_.id AS id_0,
    c0_.status AS status_1,
    c0_.username AS username_2,
    c0_.name AS name_3,
    c0_.email_id AS email_id_4
FROM cms_users c0_
INNER JOIN cms_addresses c1_
WHERE 
    c1_.city = ?
    AND c0_.id IN (?)

The problem is that the data query has:

  • unused join: INNER JOIN cms_addresses c1_
  • redundant where condition c1_.city = ?
    This PR tries to optimize the data query.

Basically, this PR does the following:

  1. Removes all WHERE conditions for data query, except the condition that comes from the id query results. Please note, that this change is not applied to the queries with subselect. Because in this case subselect part might use some params, so we can not skip the original params in the data query. It should be possible to optimize subselect cases in the future, but it would require a bit more sophisticated solution: for data query remove only original params that are not used in the subquery (instead of skipping all).
  2. All JOINs are skipped for the data query if not any join is used in SELECT, GROUP BY, or ORDER BY clauses. It is done only if the HAVING clause is not specified (simplicity is the main reason, and it can be optimized in the future).

I want to clarify that the EasyAdmin createIndexQueryBuilder triggered this PR, and all the index actions in EasyAdmin should benefit from it.

@SerheyDolgushev SerheyDolgushev force-pushed the fix/paginator-skip-redunand-joins branch from e5d9d71 to e60ba0f Compare August 23, 2023 16:49
@SerheyDolgushev SerheyDolgushev force-pushed the fix/paginator-skip-redunand-joins branch from e60ba0f to 2d74ff0 Compare August 23, 2023 16:51
@derrabus derrabus changed the base branch from 2.16.x to 2.17.x August 23, 2023 21:46
@derrabus derrabus changed the base branch from 2.17.x to 2.16.x August 23, 2023 21:46
@derrabus derrabus changed the base branch from 2.16.x to 2.17.x August 23, 2023 21:48
@derrabus
Copy link
Member

Please explain what problem exactly you're attempting to solve with your change. And please have a look at the failing tests.

@SerheyDolgushev
Copy link
Contributor Author

@derrabus please check the updated PR description, and also I pushed some code changes that fix coding standards/test and added testRedunandQueryPartsAreRemovedForWhereInWalker test which demonstrates this PR changes.

@cjunge
Copy link

cjunge commented Sep 11, 2023

@SerheyDolgushev do you have any benchmarks to show the performance improvement from this PR? I'm interested to see if there is any real-world benefit besides tweaking the generated query.
My concern is that it's introducing more complexity to the query building without any/much gain in performance.

@SerheyDolgushev
Copy link
Contributor Author

SerheyDolgushev commented Sep 13, 2023

@SerheyDolgushev do you have any benchmarks to show the performance improvement from this PR? I'm interested to see if there is any real-world benefit besides tweaking the generated query. My concern is that it's introducing more complexity to the query building without any/much gain in performance.

@cjunge I see your concerns, but performance improvement depends on the actual query. The more joins and conditions the original query has, the more its performance will improve with this change.

I just used a simple user/addresses example to illustrate my point:

CREATE TABLE cms_users (
  id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  name VARCHAR(32) NOT NULL,
  PRIMARY KEY(id)
);

CREATE TABLE cms_addresses (
  id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  user_id BIGINT UNSIGNED NOT NULL,
  country CHAR(2) NOT NULL,
  city VARCHAR(32) NOT NULL,
  street VARCHAR(32) NOT NULL,
  PRIMARY KEY(id)
);
ALTER TABLE cms_addresses ADD CONSTRAINT FK_user FOREIGN KEY (user_id) REFERENCES cms_users (id);
	
INSERT INTO cms_users (`name`)
VALUES ('John'), ('Nicole'), ('David');

INSERT INTO cms_addresses (`user_id`, `country`, `city`, `street`)
VALUES
	(1, 'UK', 'London', 'Street #1'),
	(1, 'US', 'Tampa', 'Street #2'),
	(2, 'FR', 'Paris', 'Street #3'),
	(3, 'US', 'Miami', 'Street #4')
;

And let's assume that in the original query, you are getting users with US addresses. In this case, the first query to get unique entity IDs will be:

mysql> SELECT DISTINCT c0_.id as id_0
    -> FROM cms_users c0_
    -> INNER JOIN cms_addresses c1_ ON c1_.user_id = c0_.id
    -> WHERE c1_.country = 'US';
+------+
| id_0 |
+------+
|    1 |
|    3 |
+------+
2 rows in set (0.00 sec)

And without this PR changes the second query to fetch actual users data will be:

SELECT
    c0_.id AS id_0,
    c0_.name AS name_1
FROM cms_users c0_
INNER JOIN cms_addresses c1_ ON c1_.user_id = c0_.id
WHERE 
    c1_.country = 'US'
    AND c0_.id IN (1, 3);

But with this PR changes the second query to fetch actual users data will be:

SELECT
    c0_.id AS id_0,
    c0_.name AS name_1
FROM cms_users c0_
WHERE c0_.id IN (1, 3);

The difference between these two queries is:

mysql> EXPLAIN SELECT
    ->     c0_.id AS id_0,
    ->     c0_.name AS name_1
    -> FROM cms_users c0_
    -> INNER JOIN cms_addresses c1_ ON c1_.user_id = c0_.id
    -> WHERE 
    ->     c1_.country = 'US'
    ->     AND c0_.id IN (1, 3);
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref         | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
|  1 | SIMPLE      | c0_   | NULL       | range | PRIMARY       | PRIMARY | 8       | NULL        |    2 |   100.00 | Using where |
|  1 | SIMPLE      | c1_   | NULL       | ref   | FK_user       | FK_user | 8       | test.c0_.id |    1 |    25.00 | Using where |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

VS

mysql> EXPLAIN SELECT
    ->     c0_.id AS id_0,
    ->     c0_.name AS name_1
    -> FROM cms_users c0_
    -> WHERE c0_.id IN (1, 3);
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | c0_   | NULL       | range | PRIMARY       | PRIMARY | 8       | NULL |    2 |   100.00 | Using where |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

This will become more noticeable the more joins/conditions/rows are used in the query.

Regarding introducing additional complexity with this PR, I tried to cover just the most used and basic cases to make this as simple as possible. Also, all the changes in the unit tests should give a good perspective on the context here.

Please let me know if there are any other questions you would like to discuss.

Copy link
Contributor

There hasn't been any activity on this pull request in the past 90 days, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 7 days.
If you want to continue working on it, please leave a comment.

@github-actions github-actions bot added the Stale label Oct 19, 2024
@derrabus
Copy link
Member

@SerheyDolgushev Looks like this PR did not get much traction, which is probably due to its complexity and the nieche problem that it solves. We don't introduce this kind of improvement to ORM 2 anymore, so the PR would need to be reworked to ORM 3. Do you want to do that or shall we close the PR?

@github-actions github-actions bot removed the Stale label Oct 20, 2024
@SerheyDolgushev SerheyDolgushev changed the base branch from 2.17.x to 3.3.x October 20, 2024 09:27
@SerheyDolgushev
Copy link
Contributor Author

@derrabus to be honest I had to go through the PR description to refresh my memories about it. And seems like it is still useful change and brings value. So I merged it with 3.3.x and updated the target branch. Please let me know if there is anything else I can help with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants