Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with advanced search queries #2040

Closed
ivyrze opened this issue Mar 23, 2023 · 13 comments
Closed

Performance issues with advanced search queries #2040

ivyrze opened this issue Mar 23, 2023 · 13 comments

Comments

@ivyrze
Copy link

ivyrze commented Mar 23, 2023

Running an advanced search query with more than 3 OR clauses takes over 75s to complete even with dedicated, high-performance hardware. Additional clauses increase the query time exponentially. I've tested on small installations with few items and populated metadata fields, but it doesn't make much difference. We're trying use this to search for a handful of metadata keywords, which I'd imagine is a relatively common use case.

@zerocrates
Copy link
Member

Hmm this certainly isn't generally the case with all installs: just as a data point on a local install on a laptop with ~30,000 items, the time of a 4-clause OR "contains" search isn't anywhere near 75 seconds here.

The version of Omeka S and of MySQL or MariaDB you're seeing this on would be relevant, probably. "contains" searches are never going to be particularly efficient by their nature, but if we're talking a minute-plus, that sounds like some sort of pathological mis-optimization going on somewhere.

@ivyrze
Copy link
Author

ivyrze commented Mar 24, 2023

I'm using Omeka S 4.0.1 with no modules or themes, 200 items, and MySQL 8.0.32 on an M1 Mac. I also tried MariaDB 10.11.2 but it didn't make any difference in performance times. I have been testing using the "is exactly" query, which is of course an improvement over wildcard searches, but it's still awfully sluggish. Similarly, specifying the metadata term seems to help too but only to a degree.

@zerocrates
Copy link
Member

4.0.2 makes a small change to indexing for the resources table which could make some relevant difference here.

Daniel-KM pushed a commit to Daniel-KM/Omeka-S that referenced this issue Sep 11, 2023
Daniel-KM pushed a commit to Daniel-KM/Omeka-S that referenced this issue Sep 11, 2023
@ivyrze
Copy link
Author

ivyrze commented Oct 9, 2023

I compared 4.0.1 against the latest stable version as well as #2096 using MariaDB 11.1.2, here are my results:

Omeka S version Search query Query time
4.0.1 [Any property] contains, OR x4 10s
4.0.4 14s
4.1.0-alpha6 PR 10s
4.0.1 [Any property] contains, OR x5 4m42s
4.0.4 3m49s
4.1.0-alpha6 PR 3m14s*
4.0.1 [Any property] is exactly, OR x5 2m6s
4.0.4 2m5s
4.1.0-alpha6 PR 2m6s
4.0.1 Description contains, OR x5 <1s
4.0.4 <1s
4.1.0-alpha6 PR <1s

*Kept crashing due to segmentation fault

I also noticed that the database doesn't respond to any new queries while the search is being run, so this could be used to very easily carry out a denial-of-service attack.

@zerocrates
Copy link
Member

Thanks for the research. Can I ask, were these queries performed on the admin or public sides (or rather, as a logged-in or logged-out user, more relevantly)? Since you mention the recent PR, I'm wondering about my comment there that the inclusion of is_public as the first component of the key could cause the index to be not used when is_public isn't part of the query (i.e., when the query is performed by a user that has permission to see all the items).

We had trouble before reproducing your performance results... I wonder if you could possibly share a dump of your database so we could try to investigate possible changes and/or compare results to yours?

@ivyrze
Copy link
Author

ivyrze commented Oct 10, 2023

These queries were performed from the admin panel, but we originally came across the issue while logged out on front-end, so my initial inclination is that it's no-factor, but I'll look into it more.

To populate items on my local machine for these tests, I used the OSII module to pull a duplicate of https://arepr.org/api using all default settings.

Let me know if there's anything I can do to lend a hand.

@zerocrates
Copy link
Member

OK. Specifically the logged-in/logged-out aspect should only really strongly come into play for the PR with the new index, since that index includes is_public.

@ivyrze
Copy link
Author

ivyrze commented Oct 10, 2023

Okay got it. I just ran a quick test using the PR while logged out and got similar results to above.

@zerocrates
Copy link
Member

Can you try out the branch consecutive-or-optimize?

@ivyrze
Copy link
Author

ivyrze commented Oct 17, 2023

Just from initial tinkering, this looks great – huge improvement. I'll test it out more rigorously and report back.

@ivyrze
Copy link
Author

ivyrze commented Oct 23, 2023

I ran some more tests and everything works well, within reasonable limits. Thanks again for working on this, I support it being merged

@zerocrates
Copy link
Member

Thanks, this is very helpful feedback.

I'll update when we've moved on this... probably a merge of that branch, though we'll probably squash it or otherwise clean it up a little before then.

@zerocrates
Copy link
Member

OK, it's merged (d013400).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants