Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrd workspace malfunctioning #1148

Closed
MehmedGIT opened this issue Dec 11, 2023 · 6 comments
Closed

ocrd workspace malfunctioning #1148

MehmedGIT opened this issue Dec 11, 2023 · 6 comments

Comments

@MehmedGIT
Copy link
Contributor

MehmedGIT commented Dec 11, 2023

To reproduce the issue, use the following mets. Download the MAX file group.

#!/bin/bash

export OCRD_DOWNLOAD_TIMEOUT=15
export OCRD_DOWNLOAD_RETRIES=3

WORKSPACE_DIR="urn_nbn_de_bsz_14-db-id3272770845"
OCRD_IDENTIFIER="urn nbn de bsz 14-db-id3272770845"
ORIRIGNAL_METS="$OCRD_IDENTIFIER.xml"

cp "$WORKSPACE_DIR/$ORIRIGNAL_METS" "$WORKSPACE_DIR/mets.xml"
ocrd workspace -d "$WORKSPACE_DIR" -m mets.xml find -q MAX --download
# ocrd zip bag -d "$WORKSPACE_DIR" -m mets.xml -i "$OCRD_IDENTIFIER" -q MAX --processes 4
# rm "$WORKSPACE_DIR/mets.xml"
# rm -rf "$WORKSPACE_DIR/MAX"

Output:

(venv38-operandi) mm@MM-Notebook:~/repos/ocrd_benchmarking/VD17$ ./build_workspace_zip.sh 
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None

The images of the MAX file group are downloaded successfully, regardless of the None. Then:

(venv38-operandi) mm@MM-Notebook:~/repos/ocrd_benchmarking/VD17/urn_nbn_de_bsz_14-db-id3272770845$ ocrd workspace list-page -D 4 -C 2
Traceback (most recent call last):
  File "/home/mm/venv38-operandi/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/decorators.py", line 92, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/ocrd/cli/workspace.py", line 613, in list_pages
    ids = sorted({x.pageId for x in workspace.mets.find_files(**find_kwargs)})
TypeError: '<' not supported between instances of 'NoneType' and 'lxml.etree._ElementUnicodeResult'

For some reason, the None values are returned to the sorted() method which leads to that error. The issue seems specific to workspaces that have hashes for page_id.

@kba
Copy link
Member

kba commented Dec 12, 2023

The None being printed to STDOUT are unrelated. By default ocrd workspace find outputs just the local_filename (can be overridden with -k) which is None in this case because the files are not locally available.

The reason for the problem here is that there is a mets:file with @ID="FULLDOWNLOAD" which does not belong to any specific page. Which makes sense because it is a PDF consisting of all the pages.

I think I'll work around this by just ignoring files unrelated to pages.

@MehmedGIT
Copy link
Contributor Author

The reason for the problem here is that there is a mets:file with @ID="FULLDOWNLOAD" which does not belong to any specific page. Which makes sense because it is a PDF consisting of all the pages.

I see and it makes sense. Why are such files not problematic when running an ocr-d processor over the workspace? How do processors handle such files?

@kba
Copy link
Member

kba commented Dec 13, 2023

Why are such files not problematic when running an ocr-d processor over the workspace? How do processors handle such files?

Such files are not problematic, I just didn't account for them in the list-page algorithm.

We don't have any processors that take document-global files as input but there is no fundamental reason why non-page-specific files were a problem. For example, if you add a bunch of files to a grp GRP of a workspace without specifying -g/--page-id, you can call a processor with -I GRP and it will do the right thing, but there will be no mapping of files to pages.

We do have ocrd_pagetopdf which produces such files, though.

@bertsky
Copy link
Collaborator

bertsky commented Dec 13, 2023

We do have various processors which handle document-wide files:

  • import or export of MS-COCO segmentation
  • multi-page PDF or TEI output
  • multi-page PDF or TIFF input (I know, forbidden by spec, but I would try not to be so strict)
  • evaluation reports
  • logical document structure (as global file) output

There's even an old spec issue about that.

@kba
Copy link
Member

kba commented Dec 14, 2023

I just wanted to emphasize that document-wide files are not a problem per se, we have processors that produce them and you can process arbitrary files, which might be document-wide, as long as they are in a file group. You just cannot use the --page-id mechanism for those files.

multi-page PDF or TIFF input (I know, forbidden by spec, but I would try not to be so strict)

We do disallow multi-page TIFF files but that was mostly to avoid making the process loop too complex, having to support both single- and multi-page images.

I do see the benefit of supporting PDF as input, that is a very common use case. I see no reason why we could not have a ocrd-split-pdf processor that does the splitting, then have a regular workflow based on those extracted images, followed by ocrd_pagetopdf. If there's anything in the spec to prevent that, we need to change them.

There's even an old spec issue OCR-D/spec#142 about that.

Indeed, sorry this has been open for so long. I'll answer over there.

@bertsky
Copy link
Collaborator

bertsky commented Dec 14, 2023

I see no reason why we could not have a ocrd-split-pdf processor that does the splitting, then have a regular workflow based on those extracted images

Note that ocrd-import, though not a processor, already does that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants