-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache functionality #875
Cache functionality #875
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick read-thru for now, I'll dig into the code further.
Also, why have separate |
I think this should be explanatory enough:
The Having a second thought on it. Maybe we do not benefit that much from holding fileGrp elements inside a separate cache. We can just search the tree when we need a specific element. |
I get that, but the main/only use for querying about |
Related issue: #723 Related discussion: OCR-D/zenhub#39 Still missing IIUC:
And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options. |
Could you provide more specific combinations for the desired scenarios? For example: I am getting confused by the different possible combinations. Especially, when I do not know what combinations are possible to be created by the different OCR-D processors. The current benchmark scenarios are based on:
So, for 20/50/100 pages we have respectively 1680/4200/8400 fptr. |
No, sorry, but I have no advice on specific numbers for such combinations. Your choices seem to be some "small to medium average" scenario. What I called for was adding more extreme scenarios – not necessarily under assets and not necessarily all possible combinations. |
@bertsky, here are the results for 50-500-1000-2000-5000 pages. I forced iterations to 1 because it was already taking 3 days for 5000 pages (non-cached) to finish. I have also decreased the number of regions from 10 to 2 per page. With the current settings, I got all results in just a little over 24 hours. I included 50 pages as well for comparison with the previous results of 50 pages (10 regions vs 2 regions per page). I have also created 4 groups of tests (build, build-cached, search, search-cached) instead of 2 (previously). Since the time in the table is given in ms and s I will convert some of these to hours in my comment for easier comparison. Building (non-cached) vs Building (cached): Where: ms = milliseconds, s = seconds, m = minutes, h =hours The searching methods execute search for fileGrp, fileID and physical page. For each, the best-case scenario (first hit) and worst-case scenario (not existing in the cache/element tree) searching parameters are tried. Searching (non-cached) vs Searching (cached): Issue: OCR-D/zenhub#7 |
Thanks @MehmedGIT, very good! Perhaps you could include your local results as a Markdown table into
(in particular, hundreds of files per pages combined with hundreds or even thousands of pages – this is realistic with dewarping in the workflow)
@kba your thoughts on this aspect? |
@bertsky, yes, I will do that once I have the additional extreme results (below).
I have started now benchmarking with 750 files per page and 5000 pages - both for cached and non-cached. |
I have canceled the test execution because even the building of the mets file for the non-cached version has not finished in almost 5 days. Since I could not use my work laptop for anything else efficiently due to the 100% CPU usage of the test, I had to cancel the test. I have pushed the test case for 750 files per page and 5000 pages if someone wants to play and further test with different numbers for files and pages. |
@MehmedGIT understood. (Perhaps a server machine or SSH build on CircleCI would help with your resource limitation.) It would help at least knowing how much that test took with caching enabled, though... |
@bertsky, I will execute the tests again on a cloud VM provided by my company. I was also thinking about whether 750 files per page is more than what we need. Do you think 300-400 files per page is still good enough? What is the theoretical maximum for files per page, if there are some? Another thing I could do is to increase the files per page but reduce the number of pages. Say 1500 files per page but "only" 2000 pages? 2000 pages are still roughly 5 times above the average page numbers from the statistics I have. |
yes
there's nothing I can think of. Practically, we might see newspaper pages with 1000 lines or so. Also, for the typegroups classification case, there's even a use-case for word-level image files. But these are rather rare cases. And we already get the picture by running a more conservative high-file count case. Especially if your statistics allow breaking down page count scaling behaviour.
Yes. By the same logic, I would recommend reducing the page number to under 1000. |
I have now cleaned up the test setup and merged #955 into this, I think this is ready for merge.
What is still missing for release?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kba, here are my answers. Sorry, since I implemented the code some months ago I don't fully remember what I had in mind for some of my comments. Once this is merged to the core I will have a more detailed overview again to check if something could be further simplified/improved.
Yeah, only CLI flags missing.
…nto cache_functionality
It's finished now but instead of CLI flags, we've opted for an environment variable We decided against CLI flags, because that would require a lot of changes for all the places in the code that work with METS, especially passing the We need to tackle a proper, unified configuration option soon and then align the caching behavior with it. |
Great to hear this is finally ready. My understanding was that we merge #678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already? (And since you did decide in favour of environment variables here, should that affect the discussion about profiling settings as well?) |
@MehmedGIT Do we?
For now, I would use environment variables like |
No, we don't. I was waiting for #678 to be merged to test the memory consumption. |
No description provided.