Cache functionality #875

MehmedGIT · 2022-05-31T07:58:13Z

No description provided.

kba

Just a quick read-thru for now, I'll dig into the code further.

ocrd_models/ocrd_models/ocrd_mets.py

kba · 2022-05-31T10:57:35Z

Also, why have separate _fileGrp_cache and _file_cache - doesn't the latter also contain the fileGrp info in a fast-to-access way? What would be more beneficial would be caching the pageId of a file because that requires a bit of a convoluted XPath to look up on-the-fly.

MehmedGIT · 2022-05-31T13:32:57Z

Also, why have separate _fileGrp_cache and _file_cache - doesn't the latter also contain the fileGrp info in a fast-to-access way? What would be more beneficial would be caching the pageId of a file because that requires a bit of a convoluted XPath to look up on-the-fly.

I think this should be explanatory enough:

# Cache for the fileGrps (mets:fileGrp) - a dictionary with Key and Value pair:
# Key: `'fileGrp.USE`
# Value: a 'fileGrp' object at some memory location 
self._fileGrp_cache = {}

# Cache for the files (mets:file) - two nested dictionaries
# The outer dictionary's Key: 'fileGrp.USE'
# The outer dictionary's Value: Inner dictionary
# The inner dictionary's Key: 'file.ID'
# The inner dictionary's Value: a 'file' object at some memory location
self._file_cache = {}

The _fileGrp_cache holds a reference to the fileGrp element itself aka the memory location.
The _file_cache holds just a string of the fileGrp.

Having a second thought on it. Maybe we do not benefit that much from holding fileGrp elements inside a separate cache. We can just search the tree when we need a specific element.

kba · 2022-05-31T15:48:54Z

The _fileGrp_cache holds a reference to the fileGrp element itself aka the memory location. The _file_cache holds just a string of the fileGrp.

I get that, but the main/only use for querying about fileGrp is: Does this fileGrp exist (a) and what files are in it (b)? These can be answered with the _file_cache, by checking needle in _file_cache (a) and for el_file in self._file_cache[needle]: (b). Since the fileGrp element itself contains no other information in the METS that I'm aware of, it seems redundant to keep it around.

MehmedGIT · 2022-06-07T15:13:21Z

Building the METS file with 50 files inside got improved from 8,7s down to 508ms! Here we go, less electricity consumption for everyone :)

bertsky · 2022-06-22T12:29:00Z

Related issue: #723

Related discussion: OCR-D/zenhub#39

Still missing IIUC:

I would suggest diversifying test scenarios BTW:
* thousands of pages
* dozens of fileGrps
* fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.

MehmedGIT · 2022-06-22T13:47:18Z

Related issue: #723

Related discussion: OCR-D/zenhub#39

Still missing IIUC:

I would suggest diversifying test scenarios BTW:

thousands of pages

dozens of fileGrps

fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.

Could you provide more specific combinations for the desired scenarios? For example:
Scenario one: 5 mets:fileGrp, 10 mets:file per mets:fileGrp, 1000 physical pages with 10 mets:fptr each.
etc.

I am getting confused by the different possible combinations. Especially, when I do not know what combinations are possible to be created by the different OCR-D processors.

The current benchmark scenarios are based on:

8 file groups
7 group regions
10 regions per group region (70 regions in total)
2 fileId per fileGrp
14 + 70 = 84 files per page

So, for 20/50/100 pages we have respectively 1680/4200/8400 fptr.

bertsky · 2022-06-23T10:01:30Z

No, sorry, but I have no advice on specific numbers for such combinations. Your choices seem to be some "small to medium average" scenario. What I called for was adding more extreme scenarios – not necessarily under assets and not necessarily all possible combinations.

MehmedGIT · 2022-06-28T09:07:08Z

@bertsky, here are the results for 50-500-1000-2000-5000 pages. I forced iterations to 1 because it was already taking 3 days for 5000 pages (non-cached) to finish. I have also decreased the number of regions from 10 to 2 per page. With the current settings, I got all results in just a little over 24 hours. I included 50 pages as well for comparison with the previous results of 50 pages (10 regions vs 2 regions per page).

I have also created 4 groups of tests (build, build-cached, search, search-cached) instead of 2 (previously). Since the time in the table is given in ms and s I will convert some of these to hours in my comment for easier comparison.

Building (non-cached) vs Building (cached):
50 pages: 1.140 (s) vs 95.00 (ms)
500 pages: 133.3 (s) (or 2.222 m) vs 5.800 (s)
1000 pages: 700.7 (s) (or 11.68 m) vs 26.87 (s)
2000 pages: 3 562 (s) (or 59.37 m) vs 145.7 (s) (or 2.428 m)
5000 pages: 27 059 (s) (or 7.516 h) vs 1 449 (s) (or 24.15 m)

Where: ms = milliseconds, s = seconds, m = minutes, h =hours

The searching methods execute search for fileGrp, fileID and physical page. For each, the best-case scenario (first hit) and worst-case scenario (not existing in the cache/element tree) searching parameters are tried.

Searching (non-cached) vs Searching (cached):
50 pages: 4.833 (ms) vs 2.181 (ms)
500 pages: 71.88 (ms) vs 20.88 (ms)
1000 pages: 199.5 (ms) vs 50.47 (ms)
2000 pages: 509.0 (ms) vs 106.6 (ms)
5000 pages: 1528 (ms) (or 1.528 s) vs 284.7 (ms)

Issue: OCR-D/zenhub#7

bertsky · 2022-08-22T13:00:31Z

Thanks @MehmedGIT, very good!

Perhaps you could include your local results as a Markdown table into ocrd_models/README.md or a new file under tests/model/README.md?

Still missing IIUC:

fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

(in particular, hundreds of files per pages combined with hundreds or even thousands of pages – this is realistic with dewarping in the workflow)

And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.

@kba your thoughts on this aspect?

MehmedGIT · 2022-09-29T11:14:59Z

Perhaps you could include your local results as a Markdown table into ocrd_models/README.md or a new file under tests/model/README.md?

@bertsky, yes, I will do that once I have the additional extreme results (below).

(in particular, hundreds of files per pages combined with hundreds or even thousands of pages – this is realistic with dewarping in the workflow)

I have started now benchmarking with 750 files per page and 5000 pages - both for cached and non-cached.
Please be patient about the results since the non-cached version would probably take a lot of processing time.
Just for comparison - the results above for 5000 pages (30 files per page) took 7.5 hours.

MehmedGIT · 2022-10-04T10:23:18Z

@bertsky,

I have canceled the test execution because even the building of the mets file for the non-cached version has not finished in almost 5 days. Since I could not use my work laptop for anything else efficiently due to the 100% CPU usage of the test, I had to cancel the test. I have pushed the test case for 750 files per page and 5000 pages if someone wants to play and further test with different numbers for files and pages.

bertsky · 2022-10-04T12:44:49Z

@MehmedGIT understood. (Perhaps a server machine or SSH build on CircleCI would help with your resource limitation.)

It would help at least knowing how much that test took with caching enabled, though...

MehmedGIT · 2022-10-07T12:22:19Z

@MehmedGIT understood. (Perhaps a server machine or SSH build on CircleCI would help with your resource limitation.)

It would help at least knowing how much that test took with caching enabled, though...

@bertsky, I will execute the tests again on a cloud VM provided by my company. I was also thinking about whether 750 files per page is more than what we need. Do you think 300-400 files per page is still good enough? What is the theoretical maximum for files per page, if there are some?

Another thing I could do is to increase the files per page but reduce the number of pages. Say 1500 files per page but "only" 2000 pages? 2000 pages are still roughly 5 times above the average page numbers from the statistics I have.

bertsky · 2022-10-07T13:00:27Z

Do you think 300-400 files per page is still good enough?

yes

What is the theoretical maximum for files per page, if there are some?

there's nothing I can think of. Practically, we might see newspaper pages with 1000 lines or so. Also, for the typegroups classification case, there's even a use-case for word-level image files. But these are rather rare cases. And we already get the picture by running a more conservative high-file count case. Especially if your statistics allow breaking down page count scaling behaviour.

Another thing I could do is to increase the files per page but reduce the number of pages. Say 1500 files per page but "only" 2000 pages? 2000 pages are still roughly 5 times above the average page numbers from the statistics I have.

Yes. By the same logic, I would recommend reducing the page number to under 1000.

kba · 2022-11-20T15:11:36Z

I have now cleaned up the test setup and merged #955 into this, I think this is ready for merge.

The benchmarks are not run as part of make test, but with make benchmark or make benchmark-extreme. For the CI we run make test benchmarks.
The test_ocrd_mets.py will now test both uncached and cached OcrdMets

What is still missing for release?

CLI flags
anything else? @MehmedGIT

MehmedGIT

@kba, here are my answers. Sorry, since I implemented the code some months ago I don't fully remember what I had in mind for some of my comments. Once this is merged to the core I will have a more detailed overview again to check if something could be further simplified/improved.

Yeah, only CLI flags missing.

ocrd_models/ocrd_models/ocrd_mets.py

…nto cache_functionality

kba · 2022-11-22T16:18:05Z

It's finished now but instead of CLI flags, we've opted for an environment variable OCRD_METS_CACHING which must be set to true to globally enable the METS caching.

We decided against CLI flags, because that would require a lot of changes for all the places in the code that work with METS, especially passing the cache_flag value through function calls until the METS constructor is called.

We need to tackle a proper, unified configuration option soon and then align the caching behavior with it.

bertsky · 2022-11-22T16:50:36Z

Great to hear this is finally ready. My understanding was that we merge #678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?

(And since you did decide in favour of environment variables here, should that affect the discussion about profiling settings as well?)

kba · 2022-11-23T10:02:14Z

Great to hear this is finally ready. My understanding was that we merge #678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?

@MehmedGIT Do we?

(And since you did decide in favour of environment variables here, should that affect the discussion about profiling settings as well?)

For now, I would use environment variables like about:flags in browsers. We document the behavior but require an extra step (setting the env var). For the long-term, I would now prefer a configuration file with env var overrides. We should discuss that in the tech calls and/or our workshop in December.

MehmedGIT · 2022-11-23T10:23:56Z

Great to hear this is finally ready. My understanding was that we merge #678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?

No, we don't. I was waiting for #678 to be merged to test the memory consumption.
However, there are Nextflow reports for the cached and non-cached executions of the same workflow on the same workspace here. As you know from my previous presentation of these results, interpreting the memory usage values from Nextflow reports was a bit hard task.

MehmedGIT · 2022-11-23T11:26:28Z

@bertsky, @kba I would also refer again to this summary #946. From the table, we could see the comparisons and that memory usage does not increase in all cases when caching is used.

MehmedGIT added 3 commits May 25, 2022 23:12

Cache functionality added. Main tests pass

db38852

Fixes and cache tests added

60e1a60

Fix typo

d8c2f50

kba reviewed May 31, 2022

View reviewed changes

MehmedGIT added 4 commits June 1, 2022 14:16

Applying the changes suggested by kba

ef1c757

page_cache and fptr_cache added

2414e63

Add the missed page_cache in clearCache

d7d196e

Fixing some bugs

0058823

MehmedGIT added 3 commits June 8, 2022 11:43

Extend tests for 200 pages

ce3ffc8

Comment out test case for 200 pages - takes too long

39bdf5e

No change. Trigger scrutinizer again.

f8d3ac2

Include extreme example benchmarking tests

b423d1d

MehmedGIT mentioned this pull request Jul 1, 2022

optimize METs handling OCR-D/zenhub#7

Closed

Extreme benchmark test for 750 files per page (5000 pages)

dc6e387

MehmedGIT added 2 commits October 12, 2022 17:53

clean the changes

e86b8c2

To keep the cache_functionality branch up-to-date

f1e6597

kba added 6 commits November 17, 2022 18:03

generate_range: raise ValueError if start == end

6fd0220

generate_range: choose the last number in a string

9cf0d9c

Merge branch 'master' into cache_functionality

016a370

separate targets benchmark{,-extreme} for the METS benchmarks

c9e1180

test_ocrd_mets: combine cachinig and non-caching tests

6522e54

merge #955

f7a0f5b

kba marked this pull request as ready for review November 20, 2022 15:11

MehmedGIT commented Nov 20, 2022

View reviewed changes

MehmedGIT and others added 4 commits November 21, 2022 17:04

Add fileGrp parameter to remove function

a6656da

OcrdMets.__str__: also provide cached/non-cached status

4e4b3ee

OcrdMets.__str__: fix it and str test

82b3e4f

OcrdMets: Don't defend against inconsistency cache vs XML

27b6c86

kba reviewed Nov 22, 2022

View reviewed changes

ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved

kba and others added 7 commits November 22, 2022 15:35

OcrdMets: remove outdated comment

1e8ff90

OcrdMets.set_physical_page_for_file: pageId is always a str

ffcd89f

OcrdMets: Don't defend against inconsistency cache vs XML

4da45f6

docstring for OcrdMets.remove_one_file

7724191

Merge branch 'cache_functionality' of https:/OCR-D/core i…

384b4ac

…nto cache_functionality

revert 4da45f6 (el_pagediv can be legitimately None here )

2fad30b

enable caching by setting OCRD_METS_CACHING=true env var

3c5ac1e

readme: add a stub section on configuration

f21a33a

kba merged commit 2de3329 into master Nov 23, 2022

kba deleted the cache_functionality branch November 23, 2022 10:03

kba mentioned this pull request Nov 29, 2022

Fix regressions from caching refactoring #958

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache functionality #875

Cache functionality #875

MehmedGIT commented May 31, 2022

kba left a comment

kba commented May 31, 2022

MehmedGIT commented May 31, 2022

kba commented May 31, 2022

MehmedGIT commented Jun 7, 2022 •

edited

Loading

bertsky commented Jun 22, 2022

MehmedGIT commented Jun 22, 2022

bertsky commented Jun 23, 2022

MehmedGIT commented Jun 28, 2022 •

edited

Loading

bertsky commented Aug 22, 2022

MehmedGIT commented Sep 29, 2022 •

edited

Loading

MehmedGIT commented Oct 4, 2022

bertsky commented Oct 4, 2022

MehmedGIT commented Oct 7, 2022

bertsky commented Oct 7, 2022

kba commented Nov 20, 2022

MehmedGIT left a comment •

edited

Loading

kba commented Nov 22, 2022

bertsky commented Nov 22, 2022

kba commented Nov 23, 2022

MehmedGIT commented Nov 23, 2022

MehmedGIT commented Nov 23, 2022

Cache functionality #875

Cache functionality #875

Conversation

MehmedGIT commented May 31, 2022

kba left a comment

Choose a reason for hiding this comment

kba commented May 31, 2022

MehmedGIT commented May 31, 2022

kba commented May 31, 2022

MehmedGIT commented Jun 7, 2022 • edited Loading

bertsky commented Jun 22, 2022

MehmedGIT commented Jun 22, 2022

bertsky commented Jun 23, 2022

MehmedGIT commented Jun 28, 2022 • edited Loading

bertsky commented Aug 22, 2022

MehmedGIT commented Sep 29, 2022 • edited Loading

MehmedGIT commented Oct 4, 2022

bertsky commented Oct 4, 2022

MehmedGIT commented Oct 7, 2022

bertsky commented Oct 7, 2022

kba commented Nov 20, 2022

MehmedGIT left a comment • edited Loading

Choose a reason for hiding this comment

kba commented Nov 22, 2022

bertsky commented Nov 22, 2022

kba commented Nov 23, 2022

MehmedGIT commented Nov 23, 2022

MehmedGIT commented Nov 23, 2022

MehmedGIT commented Jun 7, 2022 •

edited

Loading

MehmedGIT commented Jun 28, 2022 •

edited

Loading

MehmedGIT commented Sep 29, 2022 •

edited

Loading

MehmedGIT left a comment •

edited

Loading