Blacklist requests that are duplicates of existing resources or bound to fail #28

Popolechien · 2022-03-02T10:15:01Z

Following openzim/zimit#113, we should think about implementing a fairly easily editable list (hosted on drive.kiwix.org?) of blacklisted sites that can not be requested on zimit, e.g.

kiwix.org subdomains (download and library);
very large corporate websites (e.g. Facebook, Twitter, Reddit, Youtube, etc.)
websites that have been scraped in the past and failed.

It's probably the matter of a separate ticket, but requests for websites we already have a scraper for (wikipedia, stackoverflow, etc.) should also be soft blocked and the user offered a direct link to the zim file.

rgaudin · 2022-03-02T10:16:58Z

Can you move your comment to #25 and close this? This is the scraper's repo.

Popolechien · 2022-03-02T10:20:10Z

@rgaudin Moved it but I'd keep it open as this ticket is a little bit different.

rgaudin · 2022-03-02T10:21:50Z

This one's better ; closing the other one but the problem raised there remains: where do we point to for stuff that we know exists?

Popolechien · 2022-03-03T16:27:09Z

Is your question "in case there are several versions of the same zim" (e.g., Wikipedia mini/nopic/maxi)?

The basic assumption here is that zimit provides a copy of the real thing, so we should send them the maxi zim file.

stale · 2022-05-03T01:50:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2023-11-04T17:00:54Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blacklist requests that are duplicates of existing resources or bound to fail #28

Blacklist requests that are duplicates of existing resources or bound to fail #28

Popolechien commented Mar 2, 2022

rgaudin commented Mar 2, 2022

Popolechien commented Mar 2, 2022

rgaudin commented Mar 2, 2022

Popolechien commented Mar 3, 2022

stale bot commented May 3, 2022

kelson42 commented Nov 4, 2023

Blacklist requests that are duplicates of existing resources or bound to fail #28

Blacklist requests that are duplicates of existing resources or bound to fail #28

Comments

Popolechien commented Mar 2, 2022

rgaudin commented Mar 2, 2022

Popolechien commented Mar 2, 2022

rgaudin commented Mar 2, 2022

Popolechien commented Mar 3, 2022

stale bot commented May 3, 2022

kelson42 commented Nov 4, 2023