Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache functionality #875

Merged
merged 51 commits into from
Nov 23, 2022
Merged

Cache functionality #875

merged 51 commits into from
Nov 23, 2022

Conversation

MehmedGIT
Copy link
Contributor

No description provided.

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick read-thru for now, I'll dig into the code further.

ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
@kba
Copy link
Member

kba commented May 31, 2022

Also, why have separate _fileGrp_cache and _file_cache - doesn't the latter also contain the fileGrp info in a fast-to-access way? What would be more beneficial would be caching the pageId of a file because that requires a bit of a convoluted XPath to look up on-the-fly.

@MehmedGIT
Copy link
Contributor Author

Also, why have separate _fileGrp_cache and _file_cache - doesn't the latter also contain the fileGrp info in a fast-to-access way? What would be more beneficial would be caching the pageId of a file because that requires a bit of a convoluted XPath to look up on-the-fly.

I think this should be explanatory enough:

# Cache for the fileGrps (mets:fileGrp) - a dictionary with Key and Value pair:
# Key: `'fileGrp.USE`
# Value: a 'fileGrp' object at some memory location 
self._fileGrp_cache = {}

# Cache for the files (mets:file) - two nested dictionaries
# The outer dictionary's Key: 'fileGrp.USE'
# The outer dictionary's Value: Inner dictionary
# The inner dictionary's Key: 'file.ID'
# The inner dictionary's Value: a 'file' object at some memory location
self._file_cache = {}

The _fileGrp_cache holds a reference to the fileGrp element itself aka the memory location.
The _file_cache holds just a string of the fileGrp.

Having a second thought on it. Maybe we do not benefit that much from holding fileGrp elements inside a separate cache. We can just search the tree when we need a specific element.

@kba
Copy link
Member

kba commented May 31, 2022

The _fileGrp_cache holds a reference to the fileGrp element itself aka the memory location. The _file_cache holds just a string of the fileGrp.

I get that, but the main/only use for querying about fileGrp is: Does this fileGrp exist (a) and what files are in it (b)? These can be answered with the _file_cache, by checking needle in _file_cache (a) and for el_file in self._file_cache[needle]: (b). Since the fileGrp element itself contains no other information in the METS that I'm aware of, it seems redundant to keep it around.

@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Jun 7, 2022

Screenshot from 2022-06-07 17-01-51

Building the METS file with 50 files inside got improved from 8,7s down to 508ms! Here we go, less electricity consumption for everyone :)

@bertsky
Copy link
Collaborator

bertsky commented Jun 22, 2022

Related issue: #723

Related discussion: OCR-D/zenhub#39

Still missing IIUC:

I would suggest diversifying test scenarios BTW:
* thousands of pages
* dozens of fileGrps
* fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.

@MehmedGIT
Copy link
Contributor Author

Related issue: #723

Related discussion: OCR-D/zenhub#39

Still missing IIUC:

I would suggest diversifying test scenarios BTW:

  • thousands of pages
  • dozens of fileGrps
  • fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.

Could you provide more specific combinations for the desired scenarios? For example:
Scenario one: 5 mets:fileGrp, 10 mets:file per mets:fileGrp, 1000 physical pages with 10 mets:fptr each.
etc.

I am getting confused by the different possible combinations. Especially, when I do not know what combinations are possible to be created by the different OCR-D processors.

The current benchmark scenarios are based on:

  • 8 file groups
  • 7 group regions
  • 10 regions per group region (70 regions in total)
  • 2 fileId per fileGrp
  • 14 + 70 = 84 files per page

So, for 20/50/100 pages we have respectively 1680/4200/8400 fptr.

@bertsky
Copy link
Collaborator

bertsky commented Jun 23, 2022

No, sorry, but I have no advice on specific numbers for such combinations. Your choices seem to be some "small to medium average" scenario. What I called for was adding more extreme scenarios – not necessarily under assets and not necessarily all possible combinations.

@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Jun 28, 2022

Screenshot from 2022-06-28 06-39-30

@bertsky, here are the results for 50-500-1000-2000-5000 pages. I forced iterations to 1 because it was already taking 3 days for 5000 pages (non-cached) to finish. I have also decreased the number of regions from 10 to 2 per page. With the current settings, I got all results in just a little over 24 hours. I included 50 pages as well for comparison with the previous results of 50 pages (10 regions vs 2 regions per page).

I have also created 4 groups of tests (build, build-cached, search, search-cached) instead of 2 (previously). Since the time in the table is given in ms and s I will convert some of these to hours in my comment for easier comparison.

Building (non-cached) vs Building (cached):
50 pages: 1.140 (s) vs 95.00 (ms)
500 pages: 133.3 (s) (or 2.222 m) vs 5.800 (s)
1000 pages: 700.7 (s) (or 11.68 m) vs 26.87 (s)
2000 pages: 3 562 (s) (or 59.37 m) vs 145.7 (s) (or 2.428 m)
5000 pages: 27 059 (s) (or 7.516 h) vs 1 449 (s) (or 24.15 m)

Where: ms = milliseconds, s = seconds, m = minutes, h =hours

The searching methods execute search for fileGrp, fileID and physical page. For each, the best-case scenario (first hit) and worst-case scenario (not existing in the cache/element tree) searching parameters are tried.

Searching (non-cached) vs Searching (cached):
50 pages: 4.833 (ms) vs 2.181 (ms)
500 pages: 71.88 (ms) vs 20.88 (ms)
1000 pages: 199.5 (ms) vs 50.47 (ms)
2000 pages: 509.0 (ms) vs 106.6 (ms)
5000 pages: 1528 (ms) (or 1.528 s) vs 284.7 (ms)

Issue: OCR-D/zenhub#7

@bertsky
Copy link
Collaborator

bertsky commented Aug 22, 2022

Thanks @MehmedGIT, very good!

Perhaps you could include your local results as a Markdown table into ocrd_models/README.md or a new file under tests/model/README.md?

Still missing IIUC:

  • fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

(in particular, hundreds of files per pages combined with hundreds or even thousands of pages – this is realistic with dewarping in the workflow)

And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.

@kba your thoughts on this aspect?

@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Sep 29, 2022

Perhaps you could include your local results as a Markdown table into ocrd_models/README.md or a new file under tests/model/README.md?

@bertsky, yes, I will do that once I have the additional extreme results (below).

(in particular, hundreds of files per pages combined with hundreds or even thousands of pages – this is realistic with dewarping in the workflow)

I have started now benchmarking with 750 files per page and 5000 pages - both for cached and non-cached.
Please be patient about the results since the non-cached version would probably take a lot of processing time.
Just for comparison - the results above for 5000 pages (30 files per page) took 7.5 hours.

@MehmedGIT
Copy link
Contributor Author

@bertsky,

I have canceled the test execution because even the building of the mets file for the non-cached version has not finished in almost 5 days. Since I could not use my work laptop for anything else efficiently due to the 100% CPU usage of the test, I had to cancel the test. I have pushed the test case for 750 files per page and 5000 pages if someone wants to play and further test with different numbers for files and pages.

Screenshot from 2022-10-04 12-16-52

@bertsky
Copy link
Collaborator

bertsky commented Oct 4, 2022

@MehmedGIT understood. (Perhaps a server machine or SSH build on CircleCI would help with your resource limitation.)

It would help at least knowing how much that test took with caching enabled, though...

@MehmedGIT
Copy link
Contributor Author

@MehmedGIT understood. (Perhaps a server machine or SSH build on CircleCI would help with your resource limitation.)

It would help at least knowing how much that test took with caching enabled, though...

@bertsky, I will execute the tests again on a cloud VM provided by my company. I was also thinking about whether 750 files per page is more than what we need. Do you think 300-400 files per page is still good enough? What is the theoretical maximum for files per page, if there are some?

Another thing I could do is to increase the files per page but reduce the number of pages. Say 1500 files per page but "only" 2000 pages? 2000 pages are still roughly 5 times above the average page numbers from the statistics I have.

Seiten

@bertsky
Copy link
Collaborator

bertsky commented Oct 7, 2022

Do you think 300-400 files per page is still good enough?

yes

What is the theoretical maximum for files per page, if there are some?

there's nothing I can think of. Practically, we might see newspaper pages with 1000 lines or so. Also, for the typegroups classification case, there's even a use-case for word-level image files. But these are rather rare cases. And we already get the picture by running a more conservative high-file count case. Especially if your statistics allow breaking down page count scaling behaviour.

Another thing I could do is to increase the files per page but reduce the number of pages. Say 1500 files per page but "only" 2000 pages? 2000 pages are still roughly 5 times above the average page numbers from the statistics I have.

Yes. By the same logic, I would recommend reducing the page number to under 1000.

@kba
Copy link
Member

kba commented Nov 20, 2022

I have now cleaned up the test setup and merged #955 into this, I think this is ready for merge.

  • The benchmarks are not run as part of make test, but with make benchmark or make benchmark-extreme. For the CI we run make test benchmarks.
  • The test_ocrd_mets.py will now test both uncached and cached OcrdMets

What is still missing for release?

@kba kba marked this pull request as ready for review November 20, 2022 15:11
Copy link
Contributor Author

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kba, here are my answers. Sorry, since I implemented the code some months ago I don't fully remember what I had in mind for some of my comments. Once this is merged to the core I will have a more detailed overview again to check if something could be further simplified/improved.

Yeah, only CLI flags missing.

ocrd_models/ocrd_models/ocrd_mets.py Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Show resolved Hide resolved
@kba
Copy link
Member

kba commented Nov 22, 2022

It's finished now but instead of CLI flags, we've opted for an environment variable OCRD_METS_CACHING which must be set to true to globally enable the METS caching.

We decided against CLI flags, because that would require a lot of changes for all the places in the code that work with METS, especially passing the cache_flag value through function calls until the METS constructor is called.

We need to tackle a proper, unified configuration option soon and then align the caching behavior with it.

@bertsky
Copy link
Collaborator

bertsky commented Nov 22, 2022

Great to hear this is finally ready. My understanding was that we merge #678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?

(And since you did decide in favour of environment variables here, should that affect the discussion about profiling settings as well?)

@kba
Copy link
Member

kba commented Nov 23, 2022

Great to hear this is finally ready. My understanding was that we merge #678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?

@MehmedGIT Do we?

(And since you did decide in favour of environment variables here, should that affect the discussion about profiling settings as well?)

For now, I would use environment variables like about:flags in browsers. We document the behavior but require an extra step (setting the env var). For the long-term, I would now prefer a configuration file with env var overrides. We should discuss that in the tech calls and/or our workshop in December.

@kba kba merged commit 2de3329 into master Nov 23, 2022
@kba kba deleted the cache_functionality branch November 23, 2022 10:03
@MehmedGIT
Copy link
Contributor Author

Great to hear this is finally ready. My understanding was that we merge #678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?

No, we don't. I was waiting for #678 to be merged to test the memory consumption.
However, there are Nextflow reports for the cached and non-cached executions of the same workflow on the same workspace here. As you know from my previous presentation of these results, interpreting the memory usage values from Nextflow reports was a bit hard task.

@MehmedGIT
Copy link
Contributor Author

@bertsky, @kba I would also refer again to this summary #946. From the table, we could see the comparisons and that memory usage does not increase in all cases when caching is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants