Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to scrape large websites in a reasonable manner #333

Open
benoit74 opened this issue Jul 1, 2024 · 0 comments
Open

How to scrape large websites in a reasonable manner #333

benoit74 opened this issue Jul 1, 2024 · 0 comments

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Jul 1, 2024

Scraping large website (millions of pages) is challenging because:

  • since the scrape takes long to complete, the chance the website changes during the crawl is significant:
    • this can cause small issues like some pages missing or outdated compared to the rest of the corpus
    • this can cause more serious issues like broken links due to some pages been moved during the crawl
  • since the scrape takes long to complete, it is complex to run on the Zimfarm

One example of such a website is https://forums.gentoo.org/ where it looks like we have between 1 and 6M of pages to crawl. See openzim/zim-requests#1057

Most pages are however static, i.e. they rarely change from one crawl to the next one, so some caching could definitely help, but I have no idea how we could implement this.

So far now, I don't know how we can handle to crawl such big sites in a reasonable manner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant