ocrd workspace malfunctioning #1148

MehmedGIT · 2023-12-11T15:15:53Z

To reproduce the issue, use the following mets. Download the MAX file group.

#!/bin/bash

export OCRD_DOWNLOAD_TIMEOUT=15
export OCRD_DOWNLOAD_RETRIES=3

WORKSPACE_DIR="urn_nbn_de_bsz_14-db-id3272770845"
OCRD_IDENTIFIER="urn nbn de bsz 14-db-id3272770845"
ORIRIGNAL_METS="$OCRD_IDENTIFIER.xml"

cp "$WORKSPACE_DIR/$ORIRIGNAL_METS" "$WORKSPACE_DIR/mets.xml"
ocrd workspace -d "$WORKSPACE_DIR" -m mets.xml find -q MAX --download
# ocrd zip bag -d "$WORKSPACE_DIR" -m mets.xml -i "$OCRD_IDENTIFIER" -q MAX --processes 4
# rm "$WORKSPACE_DIR/mets.xml"
# rm -rf "$WORKSPACE_DIR/MAX"

Output:

(venv38-operandi) mm@MM-Notebook:~/repos/ocrd_benchmarking/VD17$ ./build_workspace_zip.sh 
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None

The images of the MAX file group are downloaded successfully, regardless of the None. Then:

(venv38-operandi) mm@MM-Notebook:~/repos/ocrd_benchmarking/VD17/urn_nbn_de_bsz_14-db-id3272770845$ ocrd workspace list-page -D 4 -C 2
Traceback (most recent call last):
  File "/home/mm/venv38-operandi/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/decorators.py", line 92, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mm/venv38-operandi/lib/python3.8/site-packages/ocrd/cli/workspace.py", line 613, in list_pages
    ids = sorted({x.pageId for x in workspace.mets.find_files(**find_kwargs)})
TypeError: '<' not supported between instances of 'NoneType' and 'lxml.etree._ElementUnicodeResult'

For some reason, the None values are returned to the sorted() method which leads to that error. The issue seems specific to workspaces that have hashes for page_id.

The text was updated successfully, but these errors were encountered:

kba · 2023-12-12T13:13:27Z

The None being printed to STDOUT are unrelated. By default ocrd workspace find outputs just the local_filename (can be overridden with -k) which is None in this case because the files are not locally available.

The reason for the problem here is that there is a mets:file with @ID="FULLDOWNLOAD" which does not belong to any specific page. Which makes sense because it is a PDF consisting of all the pages.

I think I'll work around this by just ignoring files unrelated to pages.

MehmedGIT · 2023-12-12T13:34:35Z

The reason for the problem here is that there is a mets:file with @ID="FULLDOWNLOAD" which does not belong to any specific page. Which makes sense because it is a PDF consisting of all the pages.

I see and it makes sense. Why are such files not problematic when running an ocr-d processor over the workspace? How do processors handle such files?

kba · 2023-12-13T16:42:48Z

Why are such files not problematic when running an ocr-d processor over the workspace? How do processors handle such files?

Such files are not problematic, I just didn't account for them in the list-page algorithm.

We don't have any processors that take document-global files as input but there is no fundamental reason why non-page-specific files were a problem. For example, if you add a bunch of files to a grp GRP of a workspace without specifying -g/--page-id, you can call a processor with -I GRP and it will do the right thing, but there will be no mapping of files to pages.

We do have ocrd_pagetopdf which produces such files, though.

bertsky · 2023-12-13T21:20:08Z

We do have various processors which handle document-wide files:

import or export of MS-COCO segmentation
multi-page PDF or TEI output
multi-page PDF or TIFF input (I know, forbidden by spec, but I would try not to be so strict)
evaluation reports
logical document structure (as global file) output

There's even an old spec issue about that.

kba · 2023-12-14T12:03:50Z

I just wanted to emphasize that document-wide files are not a problem per se, we have processors that produce them and you can process arbitrary files, which might be document-wide, as long as they are in a file group. You just cannot use the --page-id mechanism for those files.

multi-page PDF or TIFF input (I know, forbidden by spec, but I would try not to be so strict)

We do disallow multi-page TIFF files but that was mostly to avoid making the process loop too complex, having to support both single- and multi-page images.

I do see the benefit of supporting PDF as input, that is a very common use case. I see no reason why we could not have a ocrd-split-pdf processor that does the splitting, then have a regular workflow based on those extracted images, followed by ocrd_pagetopdf. If there's anything in the spec to prevent that, we need to change them.

There's even an old spec issue OCR-D/spec#142 about that.

Indeed, sorry this has been open for so long. I'll answer over there.

bertsky · 2023-12-14T12:06:29Z

I see no reason why we could not have a ocrd-split-pdf processor that does the splitting, then have a regular workflow based on those extracted images

Note that ocrd-import, though not a processor, already does that.

kba mentioned this issue Dec 12, 2023

ocrd workpace list-page: ignore files without pageId, fix #1148 #1151

Merged

kba closed this as completed in c7e4b91 Dec 14, 2023

bertsky mentioned this issue Mar 21, 2024

workspace find --download prints None for each file #1202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocrd workspace malfunctioning #1148

ocrd workspace malfunctioning #1148

MehmedGIT commented Dec 11, 2023 •

edited

Loading

kba commented Dec 12, 2023

MehmedGIT commented Dec 12, 2023

kba commented Dec 13, 2023

bertsky commented Dec 13, 2023

kba commented Dec 14, 2023

bertsky commented Dec 14, 2023

ocrd workspace malfunctioning #1148

ocrd workspace malfunctioning #1148

Comments

MehmedGIT commented Dec 11, 2023 • edited Loading

kba commented Dec 12, 2023

MehmedGIT commented Dec 12, 2023

kba commented Dec 13, 2023

bertsky commented Dec 13, 2023

kba commented Dec 14, 2023

bertsky commented Dec 14, 2023

MehmedGIT commented Dec 11, 2023 •

edited

Loading