Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special handling for known websites (WP, youtube, ted, etc) #33

Open
Popolechien opened this issue Sep 28, 2021 · 8 comments
Open

Special handling for known websites (WP, youtube, ted, etc) #33

Popolechien opened this issue Sep 28, 2021 · 8 comments
Labels
enhancement New feature or request question Further information is requested stale

Comments

@Popolechien
Copy link
Contributor

I see that almost every day (and certainly several times a week) people are running requests for Wikipedia, Wikibooks or even Youtube.
Zimit should be able to a) switch gears to run the corresponding scrapers (youtube), or directly offer the latest zim available (wikipedia, wikibooks).

@Popolechien Popolechien added the enhancement New feature or request label Sep 28, 2021
@rgaudin
Copy link
Member

rgaudin commented Sep 28, 2021

No, we've discussed that a while back and apparently, we did not create ticket but the idea was to have a list of known websites for which we refuses request and display a message explaining where to find already existing ZIMs.

Switching scraper is not practical for many reasons ; mainly because we have no limit on those other scrapers

@rgaudin rgaudin changed the title Zimit should identify jobs for which another scraper (or zim) is available Special handling for known websites (WP, youtube, ted, etc) Sep 28, 2021
@Popolechien
Copy link
Contributor Author

display a message explaining where to find already existing ZIMs.

Sounds good to me and was the main point, but then the response message should identify the target and corresponding zim (e.g. "here is the link to en.wikipedia.org's latest in available" and not "got to download.kiwix.org/zim and figure it out".

@rgaudin
Copy link
Member

rgaudin commented Sep 28, 2021

Ideally, yes. It can probably be implemented in two steps so that this gets a chance to be done.

At first, we can redirect to the Wiki where files are listed. Or maybe the library with new kiwix-serve is considered easy-enough ?

First thing you can do is list the domains and where to point to. It's easy for those we have a category for.
Youtube will require special treatment anyway as we don't have ready made ZIMs for all. I see two options:

  • we keep it as it is, but add a message on request saying this is probably not what they want and both link to the scaper and the contact form to request a custom ZIM.
  • or we block the request and show a similar message

@Popolechien
Copy link
Contributor Author

Or maybe the library with new kiwix-serve is considered easy-enough ?

This would have my preference by far, but when I look at domains, based on the past three months (and this doc) I think we can simply send them to wikipedia_en_all.zim

@kelson42
Copy link
Contributor

kelson42 commented Sep 30, 2021

We could have a ZIM metadata "source_url" and then allow library.kiwix.org to filter on it?

@kelson42 kelson42 added the question Further information is requested label Sep 30, 2021
@rgaudin
Copy link
Member

rgaudin commented Sep 30, 2021

We could have a ZIM metadata "source_url" and then allow library.kiwix.org to filter on it?

Yes, that's an interesting feature for which the default behavior might be tricky: how much matching do you want? domain? netloc ? path ? scheme ? but yeah, that would be best for us.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Mar 2, 2022
@rgaudin rgaudin transferred this issue from openzim/zimit Feb 1, 2023
@stale stale bot removed stale labels Feb 1, 2023
@stale
Copy link

stale bot commented May 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

3 participants