Reduce processing overhead for large amount of imagery data #524

ptpt · 2022-08-09T17:59:51Z

Although processing images (reading geotags) are fast, for large amount of data it's still slow. We should avoid processing the same images that has been uploaded or processed. See https://forum.mapillary.com/t/mapillary-tools-0-8-no-status-kept/6197/16

richlv · 2022-09-19T13:12:50Z

Heya, any luck with this one?
Ended up reprocessing and EXIF-testing a lot of data today, would have benefited from the suggest functionality a lot :)

ptpt · 2022-09-20T18:41:23Z

@richlv maybe this isn't an issue if you process your images once, and then keep uploading.

Instead of keeping processing and uploading:

mapillary_tools process_and_upload path/to/images

use this workflow:

# process once (no need to process again unless you have new images added)
mapillary_tools process path/to/images

# then run upload (can be run multiple times due to interruptions)
mapillary_tools upload path/to/images

richlv · 2023-01-19T22:12:44Z

maybe this isn't an issue if you process your images once, and then keep uploading

Thank you for the workaround - that might work, but it adds extra complexity in the workflow.
It already is some effort to capture and upload the images, would be great if the tools made this easier :)

Same as with the upload status tracking, could the processing status be tracked?

ptpt · 2023-01-20T18:51:39Z

It's a tradeoff here. It could be added for process in process_and_upload but it adds the complexity: we can't write status in the processed folders because they could be read-only, also users get annoyed by writing these temporary files. Then we need a centralized database to track the last time the folder was changed (mtime), and it is hard to do it right https://apenwarr.ca/log/20181113

Thank you for the workaround - that might work, but it adds extra complexity in the workflow.

Simply replace process_and_upload with upload. process_and_upload is an alias of process and upload. See: https:/mapillary/mapillary_tools#process_and_upload

richlv · 2023-01-23T11:08:11Z

Thank you so much for looking into this.
Can we please reopen this issue? Details on the reasoning follow.

As far as I understand, mapillary_tools already uses TMPDIR to store the archive files for upload.
Thus status could be stored in this same directory, or a new environment variable like MLY_STATUS_DIR could be used.

Not quite sure why mtime would need to be tracked - just storing information of a particular file being processed seems sufficient. If I recall correctly, that is how the previous processing status tracking worked.
If mtime would be desirable for cases where file path and name are unchanged, but contents change, that seems like an exception and could be left as something to be documented. Such an implementation would likely be sufficient for most if not all users/situations.

The replacing of process_and_upload with upload wouldn't work - that basically requires me to manually track processing status anyway, or script that logic myself. And it's so much better for such an improvement to be available to all users :)

To clarify on the need, a frequent usecase, especially when travelling:

Collect many images with an action camera, up to 100GB per day.
Move images off the camera storage, split in directories. Up to 30-40 directories per day.
Overnight, try to upload as many of images as possible.

To do so, I have a simple script that loops over these directories and runs process_and_upload.
Directory status varies a lot - some directories are processed. Some are uploaded, but due to varying internet access quality, they are not always uploaded sequentially (some might fail in the middle), some can be uploaded partially.
Usually only a portion of directories gets uploaded on any given day, and new directories are added on a daily basis.

Thus the only thing the script really needs to do is loop over these directories, and process & upload.
But without the status tracking for processing in mapillary_tools, it's either a lot of re-processing, or I need to implement status tracking in my script. This is very tricky to do if I'd like to do it per image. Per directory could result in incorrect status if processing is interrupted. Any such implementation outside of mapillary_tools would also be fragile, as tool changes could break it, and cause data loss.

ptpt · 2023-01-23T18:40:06Z

Thanks for the context. Would you mind sharing the relevant part of your script?

Here is a simple idea that avoids reprocess by checking the existence of each image directory's mapillary_image_description.json

#!/bin/sh
set -e

# loop through all image directories
for image_dir in "$@"; do
    if [ -d "$image_dir" ]; then
        desc_path="${image_dir}/mapillary_image_description.json"
        # check if the desc file exists
        if [ -f "$desc_path" ]; then
            echo processed already "$image_dir"
        else
            mapillary_tools process "$image_dir" --desc_path="$desc_path"
        fi
        mapillary_tools upload "$image_dir" --desc_path="$desc_path"
    fi
done

Usage:

sh loop.sh ~/daily_captures/*

richlv · 2023-01-23T22:19:32Z

The script, if we exclude the part that does some GPX trickery for edge cases, is trivial.
$1 is path (passed from a for loop in the command), $gt usually is empty.

export TMPDIR='/somewhere/tmp'
...
echo "y" | mapillary_tools process_and_upload --user_name richlv --interpolate_directions "$1" $gt --skip_process_errors

Thank you for the script idea, greatly appreciated.

If checking for "mapillary_image_description.json" is a good approach, what would be the reason not to do so in mapillary_tools right away, and skip processing if found?
That seems simpler than users scripting their own hacks, and there could be a status message when the file is found, thus making the behaviour very obvious.

ptpt · 2023-01-23T23:51:02Z

If checking for "mapillary_image_description.json" is a good approach, what would be the reason not to do so in mapillary_tools right away, and skip processing if found?

Think of mapillary_tools process image_dir as exiftool image_dir whose job is to extract metadata from images in the folder and write them out. Then checking this output file might be confusing to users.

I will keep the issue open until a good solution. For now, I would suggest script it:

image_dir="$1"
desc_path="${image_dir}/mapillary_image_description.json"
# check if the desc file exists
if ! [ -f "$desc_path" ]; then
    mapillary_tools process "$image_dir" $gt --desc_path="$desc_path" --interpolate_directions --skip_process_errors
fi
mapillary_tools upload "$image_dir" --desc_path="$desc_path" --user_name richlv

to replace

echo "y" | mapillary_tools process_and_upload --user_name richlv --interpolate_directions "$1" $gt --skip_process_errors

richlv · 2023-05-16T11:23:16Z

Have we observed specific scenarios where it would be confusing?
Wondering whether the current overhead isn't causing more confusion :)
A long time ago I added status checks in mapillary_tools and did a pull request, but that was before the full rewrite or two, won't be able to do anything merge-worthy now.

ptpt · 2023-05-24T19:55:08Z

@richlv could you try out https:/mapillary/mapillary_tools/releases/tag/v0.10.2? The image processing performance is improved significantly in this version (700 images per second on my laptop MBP M1).

richlv · 2023-05-24T20:15:05Z

Thanks, that indeed is faster - greatly appreciated :)
It would still be great to automatically avoid re-processing. Just this week I was several times restarting uploads with 10+ directories, 2GB of images per directory.

Is the potential for confusion easy to avoid, if it is still a concern?

ptpt added the enhancement label Aug 9, 2022

ptpt changed the title ~~Improve processing overhead for large amount of imagery data~~ Reduce processing overhead for large amount of imagery data Aug 9, 2022

JakeSmarter mentioned this issue Oct 13, 2022

Upload history should be qualified as cache data #568

Open

ptpt closed this as completed Jan 20, 2023

ptpt reopened this Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce processing overhead for large amount of imagery data #524

Reduce processing overhead for large amount of imagery data #524

ptpt commented Aug 9, 2022

richlv commented Sep 19, 2022

ptpt commented Sep 20, 2022 •

edited

Loading

richlv commented Jan 19, 2023

ptpt commented Jan 20, 2023 •

edited

Loading

richlv commented Jan 23, 2023 •

edited

Loading

ptpt commented Jan 23, 2023 •

edited

Loading

richlv commented Jan 23, 2023 •

edited by ptpt

Loading

ptpt commented Jan 23, 2023

richlv commented May 16, 2023

ptpt commented May 24, 2023 •

edited

Loading

richlv commented May 24, 2023

Reduce processing overhead for large amount of imagery data #524

Reduce processing overhead for large amount of imagery data #524

Comments

ptpt commented Aug 9, 2022

richlv commented Sep 19, 2022

ptpt commented Sep 20, 2022 • edited Loading

richlv commented Jan 19, 2023

ptpt commented Jan 20, 2023 • edited Loading

richlv commented Jan 23, 2023 • edited Loading

ptpt commented Jan 23, 2023 • edited Loading

richlv commented Jan 23, 2023 • edited by ptpt Loading

ptpt commented Jan 23, 2023

richlv commented May 16, 2023

ptpt commented May 24, 2023 • edited Loading

richlv commented May 24, 2023

ptpt commented Sep 20, 2022 •

edited

Loading

ptpt commented Jan 20, 2023 •

edited

Loading

richlv commented Jan 23, 2023 •

edited

Loading

ptpt commented Jan 23, 2023 •

edited

Loading

richlv commented Jan 23, 2023 •

edited by ptpt

Loading

ptpt commented May 24, 2023 •

edited

Loading