How to scrape large websites in a reasonable manner #333

benoit74 · 2024-07-01T12:38:52Z

Scraping large website (millions of pages) is challenging because:

since the scrape takes long to complete, the chance the website changes during the crawl is significant:
- this can cause small issues like some pages missing or outdated compared to the rest of the corpus
- this can cause more serious issues like broken links due to some pages been moved during the crawl
since the scrape takes long to complete, it is complex to run on the Zimfarm

One example of such a website is https://forums.gentoo.org/ where it looks like we have between 1 and 6M of pages to crawl. See openzim/zim-requests#1057

Most pages are however static, i.e. they rarely change from one crawl to the next one, so some caching could definitely help, but I have no idea how we could implement this.

So far now, I don't know how we can handle to crawl such big sites in a reasonable manner

benoit74 added enhancement question labels Jul 1, 2024

benoit74 mentioned this issue Jul 1, 2024

New request: forums.gentoo.org openzim/zim-requests#1057

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to scrape large websites in a reasonable manner #333

How to scrape large websites in a reasonable manner #333

benoit74 commented Jul 1, 2024

How to scrape large websites in a reasonable manner #333

How to scrape large websites in a reasonable manner #333

Comments

benoit74 commented Jul 1, 2024