-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to only crawl website and not run warc2zim conversion #297
Comments
PR looks good but I'm not sure about the operational value of this
How is that a problem if we can keep the WARCs (and even upload them via Zimfarm)? warc2zim is a speedy process
Is this an existing problem? Where's the ticket about this? openzim/warc2zim#132 is zimit1 AFAIK |
Speedy compared to the crawl, yes. But still time consuming for probably nothing.
It happened to me during development on my machine, I don't see why it would not happen during production. And I most probably have a case this morning but still did not had time to collect material to open corresponding issue. |
But you're right that from an operational point of view, it would make more sense to adapt Zimfarm worker so that logs and artifacts are uploaded even when a cancellation is requested. It would probably help in multiple cases. |
Yes, it's a huge frustration point for most scrapers |
Then, let's close this in favor of openzim/zimfarm#965 |
For debugging purpose, it might be useful to only run the crawling and not run warc2zim conversion (which might be known to fail, or even hang forever in a dead loop).
We should add a
--crawl-only
CLI argument to support this scenario (and integrate this in the Zimfarm obviously).The text was updated successfully, but these errors were encountered: