Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of input_files on large workspaces #723

Open
bertsky opened this issue Sep 28, 2021 · 9 comments
Open

performance of input_files on large workspaces #723

bertsky opened this issue Sep 28, 2021 · 9 comments

Comments

@bertsky
Copy link
Collaborator

bertsky commented Sep 28, 2021

By implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.

Here's a typical scenario:

  1. ocrd-cis-ocropy-dewarp creates an image file for each text line, and references it under the pageId and fileGrp which it belongs to – under an image mimetype (this creates 29.000 files for me in a workspace with 500 pages)
  2. ocrd-tesserocr-recognize runs afterwards and queries its self.input_files:
    for file_ in sorted(self.workspace.mets.find_all_files(
    pageId=self.page_id, fileGrp=ifg, mimetype=mimetype),
    # sort by MIME type so PAGE comes before images
    key=lambda file_: file_.mimetype):
    if not file_.pageId:
    continue
  3. this searches through all mets:file entries, matching them for fileGrp (which is reasonably fast, it only gets a little inefficient when additionally filtering by pageId):
    for cand in self._tree.getroot().xpath('//mets:file', namespaces=NS):
    if ID:
    if ID.startswith(REGEX_PREFIX):
    if not fullmatch(ID[REGEX_PREFIX_LEN:], cand.get('ID')): continue
    else:
    if not ID == cand.get('ID'): continue
    if pageId is not None and cand.get('ID') not in pageId:
    continue
    if fileGrp:
    if fileGrp.startswith(REGEX_PREFIX):
    if not fullmatch(fileGrp[REGEX_PREFIX_LEN:], cand.getparent().get('USE')): continue
    else:
    if cand.getparent().get('USE') != fileGrp: continue
    if mimetype:
    if mimetype.startswith(REGEX_PREFIX):
    if not fullmatch(mimetype[REGEX_PREFIX_LEN:], cand.get('MIMETYPE') or ''): continue
    else:
    if cand.get('MIMETYPE') != mimetype: continue
    if url:
    cand_locat = cand.find('mets:FLocat', namespaces=NS)
    if cand_locat is None:
    continue
    cand_url = cand_locat.get('{%s}href' % NS['xlink'])
    if url.startswith(REGEX_PREFIX):
    if not fullmatch(url[REGEX_PREFIX_LEN:], cand_url): continue
    else:
    if cand_url != url: continue
    f = OcrdFile(cand, mets=self)
  4. Then in line 298 (and again further below) it queries OcrdFile.pageId:
    def pageId(self):
    """
    Get the ``@ID`` of the physical ``mets:structMap`` entry corresponding to this ``mets:file`` (physical page manifestation).
    """
    if self.mets is None:
    raise Exception("OcrdFile %s has no member 'mets' pointing to parent OcrdMets" % self)
    return self.mets.get_physical_page_for_file(self)
  5. This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of input_files):
    def get_physical_page_for_file(self, ocrd_file):
    """
    Get the physical page ID (``@ID`` of the physical ``mets:structMap`` ``mets:div`` entry)
    corresponding to the ``mets:file`` :py:attr:`ocrd_file`.
    """
    ret = self._tree.getroot().xpath(
    '/mets:mets/mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"][./mets:fptr[@FILEID="%s"]]/@ID' %
    ocrd_file.ID, namespaces=NS)

A little cosmetics like turning OcrdFile.pageId into a functools.cached_property won't help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

@kba
Copy link
Member

kba commented Sep 28, 2021

5. This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of input_files):

This is obviously bad and needs fixing :(

Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

I did work on an OcrdMetsFilter class for more fine-grained control for cloning workspaces but it was supposed to be used wherever METS is to be queried, cf. #582. One plan I had was to cache pageId in there to take the penalty of parsing everything just once. Had to abandon the PR for other business, not even sure how much effort it would be to merge master into it. But maybe it can serve as a starrting point for a discussion on how to improve the query performance.

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 29, 2021

I did work on an OcrdMetsFilter class for more fine-grained control for cloning workspaces but it was supposed to be used wherever METS is to be queried, cf. #582. One plan I had was to cache pageId in there to take the penalty of parsing everything just once.

Okay, so this is an elaborate filter for find_files, allowing for negative tests as well (but before we had .. ranges for pageId). The function itself has the same structure though. I don't see just where the pageId caching should happen. But it could use some cached pageId2fileId mapping (which will speed up any non-first call to OcrdMets.find_files). With the above Processor.input_files loop however, it's the other direction we need, fileId2pageId (which will speed up any non-first call to OcrdFile.pageId and reduce the complexity for Processor.input_files).

I think that we should try to parse mets:fileSec and mets:structMap only once (into fileId/pageId and fileId/fileGrp dicts which we can reverse if needed): during Workspace's or OcrdMets's constructor (which is run in Resolver.workspace_from_url in processor tasks and in ocrd.cli.workspace).

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 30, 2022

Just to add another use-case to the scenario: even a simple OcrdMets.add_file can become inefficient on large workspaces. (Becoming as slow as 1 op/sec.) The reason is that it looks for existing files of the same ID first:

mets_file = next(self.find_files(ID=ID), None)

So the proposed caching should also happen here.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 30, 2022

@MehmedGIT what's your status of the OcrdMets profiling experiment?

@MehmedGIT
Copy link
Contributor

@bertsky I have pushed my latest changes to the benchmarking branch. I have not been working on that experiment after that. @mweidling is investigating this topic in more depth and I am available for discussions and support if needed. My personal opinion is that we should try to optimize the OcrdMets functionalities as soon as possible.

@mweidling
Copy link
Collaborator

Just to add another use-case to the scenario: even a simple OcrdMets.add_file can become inefficient on large workspaces. (Becoming as slow as 1 op/sec.) The reason is that it looks for existing files of the same ID first:

mets_file = next(self.find_files(ID=ID), None)

So the proposed caching should also happen here.

Thanks for pointing that out, @bertsky !

@bertsky
Copy link
Collaborator Author

bertsky commented May 2, 2022

@MehmedGIT I don't understand: the benchmark-mets branch does not seem to contain any actual changes to the modules, only additional tests in benchmarks (i.e. outside the normal test set). Also, under test_mets2.py it contains a ExtendedOcrdMets class with some caching.

Could it be that you actually need to incorporate these changes into ocrd_models.ocrd_mets.OcrdMets proper, so that one can see the diff and test in situ?

@MehmedGIT
Copy link
Contributor

@bertsky you are right. I have not changed the actual modules. In order to implement my own version of the functions I have extended the OcrdMets class. The reason I did that was because I did not know how to compile just the OcrdMets class alone 😅 and found fast solution that way. Moreover, when I implemented ExtendedOcrdMets class it was more like a proof of concept to see if the functions can be optimized and to get some comparison results to trigger the further discussion. The changes in the bechmarks branch are not a proper implementation ready to be merged.

@bertsky
Copy link
Collaborator Author

bertsky commented May 2, 2022

@MehmedGIT I see, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants