Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dont leak ip address #310

Merged
merged 4 commits into from
Sep 18, 2023
Merged

Dont leak ip address #310

merged 4 commits into from
Sep 18, 2023

Conversation

drzraf
Copy link
Contributor

@drzraf drzraf commented Sep 17, 2023

Long needed, long requested feature to stop leaking contributors/uploaders IP addresses.
#145, #146, #192

Here finally implemented by carefully parsing the URL as suggested previously.

  • Includes unit tests
  • Only replace the JSON file on-disk if actually necessary

This reverts commit 0911477.

urllib3 is needed to postprocess/parse/URL for sanitization and privacy purpose (bibanon#192)
We fix this by carefully redacting the IP address in the JSON fields known to contain it.
@brandongalbraith
Copy link
Collaborator

Not all heroes wear capes. Will review tomorrow and if everything looks good, we'll ship at as soon as possible. Thanks for putting the time into this.

@vxbinaca
Copy link
Collaborator

vxbinaca commented Sep 17, 2023

flake8 error.

Do you have example test upload from this branch?

edit:

Do we need to lock that dep version? Can that be lifted?

Copy link
Collaborator

@vxbinaca vxbinaca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this dep version locked?

@vxbinaca
Copy link
Collaborator

I have questions @drzraf

  • Do we need the urllib dep locked to that version?
  • Can you show me a example upload where the IP address is redacted?
  • Can you show this doesn't trigger false positives?

When you answer these three questions I'll merge.

@drzraf
Copy link
Contributor Author

drzraf commented Sep 18, 2023

* Do we need the urllib dep locked to that version?

No, I just reverted a previous commit for sake of commit history/revision tracking/reviewing process. I can remove the version lock if you prefer so.

* Can you show me a example upload where the IP address is redacted?

It's not clear to me what do you want and why. A modified file? It easy to generate one:

#!/usr/bin/python3
import json, sys
from tubeup.utils import strip_ip_from_meta

with open(sys.argv[1], 'r') as f:
    _, new_meta = strip_ip_from_meta(json.load(f))
    print(json.dumps(new_meta))
# Run with test.py tests/test_tubeup_rootdir/downloads/Mountain_3_-_Video_Background_HD_1080p-6iRV8liah8A.info.json

I'm attaching a modified file if that's what you meant: a.json.gz

* Can you show this doesn't trigger false positives?

It's safe because:

  • Transformations only apply to 4 specific fields, all of them are URLs
  • For the query-string: URL are parsed with a dedicated standard library rather than regexp, no error is possible
  • For the path segment, it's processed separately from the rest of the URL limiting the scope
  • I assume path segment processing regexp to be safe unless told otherwise: r'/ip/[^/]+', r'/ip/REDACTED'
  • /ip/ is very specific (and I've empirically looked for it among hundreds of info files) and following the observed structure of Youtube URL, is always followed by a (non-empty) IP address.
  • The non-greedy segment match is also safe
  • TBH I've not looked at all the 1068 possible websites supported by youtube-dl to check if any of them would return an URL containing /ip/ followed by something which is not an IP address (or using ip= parameter for some other purpose), nor have I resources and will to do so. The underlying idea is to fix an actual grave concern more than overthink an improbable rare inoffensive future one.

Let's keep in mind that, overall, these files contains metadata related to the downloaded video. Some are very useful, other are garbage. In the case of these URL, then often contain expiring tokens/query parameters making them unreachable after a specific amount of time: they are in no way stable or canonical resources/identifier (some suggested to even drop them entirely). BTW, Google could still identify uploaders based on the temporary tokens parameters which this PR doesn't currently redact.

tl;dr : These are mostly unimportant metadata but reasonable efforts have been provided to ensure safely obfuscation so that no side-effect is expected on Youtube and there is very very low probability a significant side-effect happened to the URLs from another provider.

@vxbinaca vxbinaca merged commit 9c2ae72 into bibanon:master Sep 18, 2023
5 checks passed
@vxbinaca
Copy link
Collaborator

vxbinaca commented Sep 18, 2023

BTW, Google could still identify uploaders based on the temporary tokens parameters which this PR doesn't currently redact.

Useless because most people upload from Linux where they don't normally log in, or use a separate cookie file.

Edit:

Thanks for this I'll cut a new version later.

@vxbinaca
Copy link
Collaborator

It's not clear to me what do you want and why.

I was asking for a link to a example upload where you tested this PR. Beyond just the test. I wanted to inspect the results of what it does.

@vxbinaca
Copy link
Collaborator

Your PR broke Tubeup. Expect a hard reset on the repo, and extreme scrutiny and testing standards for your PRs in the future.

@drzraf
Copy link
Contributor Author

drzraf commented Sep 22, 2023

Would you mind explaining what did it broke?
NB: I have diff-ed info json files before/after processing, and visually compared them in order to ensure adequate replacements.

@vxbinaca
Copy link
Collaborator

Would you mind explaining what did it broke? NB: I have diff-ed info json files before/after processing, and visually compared them in order to ensure adequate replacements.

This is a lot of effort to not show me a URL.

That PR put me through so much work, I tried to lift then re-added the urllib3 version lock, pushed new versions to pypi. I had to yank 3 releases due to that PR then hard reset Github I couldn't revert your PR due to me doing a bunch of other work after it, which is now all gone.

Personally I'm not seeing a security issue from the IP in a VOD URL. Use a VPS.

@drzraf
Copy link
Contributor Author

drzraf commented Sep 25, 2023

  1. show me a URL, while syntactically correct, makes no sense. URL shown: "https://google.com". Could you please be more explicit?
  2. You didn't, by any mean, explained what was wrong with the current PR. What did it actually broke?
  3. You explained why doing a force-push is painful. (Well, nothing stopped you from just reverting, but reasons are yours)
  4. Leaking IP addresses always been and will always be a problem (That's the reason we haven't yet encountered someone disclosing it on his GitHub profile)

My main question remains : what's wrong with the PR?
By answering, it could be improved, tested, and finally merged.

I obviously assume good faith. I also assume only technical implementations details are at stake here (unlike in #311).
If, for any reason, you wished submitter's IP addresses to end up published on archive.org, then it'd be definitely preferible to state it formally so that the discussion could be elegantly and quickly closed.

Otherwise, I suggest to reopen the PR and offer details about the problems it may have cause, so that I revamp and improve it.

thx

@mrpapersonic
Copy link
Collaborator

mrpapersonic commented Sep 25, 2023

  1. show me a URL, while syntactically correct, makes no sense. URL shown: "https://google.com". Could you please be more explicit?

He just wants the URL of a successful upload with this patch. I'd assume your code is right, but only works for YouTube and breaks other sites.

@vxbinaca
Copy link
Collaborator

No it broke everything.

@vxbinaca
Copy link
Collaborator

@mrpapersonic If you want this PR in, I want it verified it works. So since the author cannot just give me an example URL, if you want this in make sure it works first. Because me and #313 saw how it was broken

@vxbinaca
Copy link
Collaborator

@drzraf

Hi, you haven't enabled the ability to make issues on your fork so I'm doing this here.

Every time a item is uploaded to Archive.org the "Scanner" tag on the item lists the script name and version:

Scanner:    TubeUp Video Stream Mirroring Application 2023.08.19 

BinAnon is requesting you change this, you can make it "tubeup-drzraf" or whatever you want, but not our Scanner ID. Your fork divirges from ours in function significantly, specifically with open yt-dlp flags that allow for non-standard output and we don't want staff or future historians confusing the two. This is for automation of fixing items by IA staff.

Additionally you list a link to our PyPi repo. Please also change that if you're going to fork.

Please change this ASAP.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants