Add option to only crawl website and not run warc2zim conversion #297

benoit74 · 2024-05-13T06:42:57Z

For debugging purpose, it might be useful to only run the crawling and not run warc2zim conversion (which might be known to fail, or even hang forever in a dead loop).

We should add a --crawl-only CLI argument to support this scenario (and integrate this in the Zimfarm obviously).

The text was updated successfully, but these errors were encountered:

rgaudin · 2024-05-13T11:21:25Z

PR looks good but I'm not sure about the operational value of this

which might be known to fail

How is that a problem if we can keep the WARCs (and even upload them via Zimfarm)? warc2zim is a speedy process

or even hang forever in a dead loop

Is this an existing problem? Where's the ticket about this? openzim/warc2zim#132 is zimit1 AFAIK

benoit74 · 2024-05-13T11:47:01Z

How is that a problem if we can keep the WARCs (and even upload them via Zimfarm)? warc2zim is a speedy process

Speedy compared to the crawl, yes. But still time consuming for probably nothing.

Is this an existing problem?

It happened to me during development on my machine, I don't see why it would not happen during production. And I most probably have a case this morning but still did not had time to collect material to open corresponding issue.

benoit74 · 2024-05-13T11:52:13Z

openzim/warc2zim#246

benoit74 · 2024-05-13T11:53:51Z

But you're right that from an operational point of view, it would make more sense to adapt Zimfarm worker so that logs and artifacts are uploaded even when a cancellation is requested. It would probably help in multiple cases.

rgaudin · 2024-05-13T11:55:25Z

Yes, it's a huge frustration point for most scrapers

benoit74 · 2024-05-13T13:33:28Z

Then, let's close this in favor of openzim/zimfarm#965

benoit74 added the enhancement label May 13, 2024

benoit74 self-assigned this May 13, 2024

benoit74 mentioned this issue May 13, 2024

Add support for only crawling the website, not calling warc2zim #298

Closed

benoit74 closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to only crawl website and not run warc2zim conversion #297

Add option to only crawl website and not run warc2zim conversion #297

benoit74 commented May 13, 2024

rgaudin commented May 13, 2024

benoit74 commented May 13, 2024

benoit74 commented May 13, 2024

benoit74 commented May 13, 2024

rgaudin commented May 13, 2024

benoit74 commented May 13, 2024

Add option to only crawl website and not run warc2zim conversion #297

Add option to only crawl website and not run warc2zim conversion #297

Comments

benoit74 commented May 13, 2024

rgaudin commented May 13, 2024

benoit74 commented May 13, 2024

benoit74 commented May 13, 2024

benoit74 commented May 13, 2024

rgaudin commented May 13, 2024

benoit74 commented May 13, 2024