Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor.input_files broke for pageId selector lists #622

Closed
bertsky opened this issue Oct 8, 2020 · 1 comment
Closed

Processor.input_files broke for pageId selector lists #622

bertsky opened this issue Oct 8, 2020 · 1 comment
Assignees
Labels

Comments

@bertsky
Copy link
Collaborator

bertsky commented Oct 8, 2020

There is a regression in 84a4e1a: When passing multiple pages for an image-only input fileGrp, e.g. -g phys_0001,phys_0007 -I OCR-D-IMG, now the logic that tries to prevent mixing derived images with original images is falsely triggered:

ret = self.workspace.mets.find_all_files(
fileGrp=self.input_file_grp, pageId=self.page_id, mimetype="//image/.*")
if self.page_id and len(ret) > 1:
raise ValueError("No PAGE-XML %s in fileGrp '%s' but multiple images." % (
"for page '%s'" % self.page_id if self.page_id else '',
self.input_file_grp
))
return ret

The problem is that self.page_id here is actually a list (formatted in comma-join notation).

So the correct way of ensuring that no single page gets multiple image file results is by

  • either disallowing find_all_files to aggregate them like this (which is probably valid in other contexts, though)
  • or going through its result ret and checking whether any of its pageIds repeat:
page_ids = [file.pageId for file in ret]
if len(page_ids) != len(set(page_ids)):
@bertsky bertsky added the bug label Oct 8, 2020
@kba kba self-assigned this Oct 9, 2020
@bertsky
Copy link
Collaborator Author

bertsky commented Nov 3, 2020

Fixed in #635

@bertsky bertsky closed this as completed Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants