Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double slash in query string (and path) are not working properly #316

Closed
benoit74 opened this issue Jun 17, 2024 · 2 comments · Fixed by #365
Closed

Double slash in query string (and path) are not working properly #316

benoit74 opened this issue Jun 17, 2024 · 2 comments · Fixed by #365
Assignees
Labels
bug Something isn't working
Milestone

Comments

@benoit74
Copy link
Collaborator

URL rewriting is not working properly when we have double / in query string.

E.g. if HTML document URL is https://www.example.com/index.html?url=https://www.example.com/foo, then everything is broken on the page in Kiwix-server due to improper URL rewriting / conflicting browser behavior / ...

The problem is that for such an original HTML document URL, the ZIM Path is www.example.com/index.html?url=https://www.example.com/foo and the URL to load this in kiwix-serve is http://yourserver/content/yourzim/www.example.com/index.html%3Furl%3Dhttps%3A//www.example.com/foo

From a browser/server perspective, the query string is dropped (this is normal for proper operation in all readers) and we hence end-up with a double slash //.

This is not a promising URL because how relative URLs are computed given this double slash is obviously not standardised. Or at least Python (used to compute relative path in scraper) and Firefox (used to display the ZIM) have different PoV on this. And we can be sure we will have very varied ways to interpret this URL and relative URLS inside the document in our various readers.

I hence recommend that when normalizing URLs into a ZIM Path, we get rid of any // (or /// and so on) and simplify them to /. This comes with a minimal risk of collision we can most probably assume is negligible. It would also break some websites which uses the query string value in JS code and really need the proper value. We'll see how often this happens.

Nota: An alternative could have been to keep same ZIM Path and change the way we transform ZIM Path into URL by saying that a ZIM path like foo//bar should lead to a URL foo/%2Fbar. This would ensure we keep the same look to URLs displayed in browser bar (%2F will continue to be displayed as /) and we minimize the risk of collisions (we keep same number of characters). The big problem with this approach is that since the conversion can only be done when we transform the Path into a URL (we cannot change the ZIM Path, encoded chars should not be present inside the ZIM Path), then it means that all computation of relative links which are based on the ZIM Path are not working anymore and also have to be adapted ... which seems overly complex.

Nota: this bug has been observed in the wild on https://edu.gcfglobal.org/pt/criar-um-correio-eletronico/como-funciona-um-correio-eletronico/1/# where we have an iFrame to https://support.gcfglobal.org/form/?url=https://edu.gcfglobal.org/pt/criar-um-correio-eletronico/como-funciona-um-correio-eletronico/1/

@benoit74 benoit74 added the bug Something isn't working label Jun 17, 2024
@benoit74
Copy link
Collaborator Author

Nota: real world example of edu.gcfglobals.org tend to prove my suggestion might not be the best fit: iFrame is anyway useless (should be hidden / removed) + result is not going to work anyway since URL has been broken. Maybe we should simply use a fuzzy rule to simplify path and not store too many items for nothing.

@benoit74 benoit74 self-assigned this Jul 24, 2024
@benoit74 benoit74 added this to the 2.1.0 milestone Jul 24, 2024
@benoit74
Copy link
Collaborator Author

Yet another occurrence at https://womenshistory.si.edu/blog/gold-standard-how-these-iconic-olympic-athletes-inspired-and-united-us

Here we have an iFrame to for instance https://womenshistory.si.edu/media/oembed?url=https://www.youtube.com/watch?v=PcQ9QeWiQEY&max_width=1280&max_height=720&hash=BnsRjBKr7HzRSJEc4lZOqZ6UCNAtlDgm8OMMZbjaI1g

Properly rewrite the // to / would here solve this website issue. So I propose to move this issue forward with the change I've proposed in earlier comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant