Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove WayBack Machine bits from downloaded files #181

Open
smontanaro opened this issue Nov 29, 2022 · 2 comments
Open

Remove WayBack Machine bits from downloaded files #181

smontanaro opened this issue Nov 29, 2022 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@smontanaro
Copy link

I'm just about done downloading a defunct website from the Wayback Machine. waybackpy has been quite helpful. The website is/was a small reference site. We think the owner passed away, so I'm trying to reconstruct it.

Initially, I thought the HTML files had a bit of JS in the header and a footer on the file identifying the times and dates. Looking deeper, it seems there is quite a few more strands of Wayback Machine bits embedded in the files. (Even some "JPG" files are actually bits of WM Javascript.) I'm finessing any copyright issues for the moment (searching the Internet Archive doesn't lead to much, mostly about copyright on movies and books).

Are any tools available for cleaning up the downloaded files? Note that I don't expect waybackpy to be modified to perform this function. I've come up empty in my search though, so I thought maybe people here might have some pointers.

@smontanaro smontanaro added the enhancement New feature or request label Nov 29, 2022
@akamhy
Copy link
Owner

akamhy commented Nov 30, 2022

@smontanaro Try reading http://web.archive.org/web/20110812221535id_/http://faq.web.archive.org/page-without-wayback-code/

20110812221535id_ does the trick. I will release an option in the future in this library but right now I don't have enough time.

Just append id_ after the timestamp in the URI.

@akamhy
Copy link
Owner

akamhy commented Nov 30, 2022

Also, please don't close it as this would be a cool feature for this library, I will implement an option to change the URI for fetching the original source of webpages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants