Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SOCKS Proxy support #682

Merged
merged 5 commits into from
Aug 28, 2024
Merged

Conversation

gremid
Copy link
Contributor

@gremid gremid commented Aug 27, 2024

This PR adds SOCKS proxy support in trafilatura.downloads. It works for requests based on urllib3 as well as pycurl, is configured via the environment variable http_proxy and tested with SOCKS5 proxies (with and without BASIC authentication).

@gremid
Copy link
Contributor Author

gremid commented Aug 27, 2024

@adbar Please let me know, if I can help with documenting the feature.

@gremid gremid force-pushed the feature/socks-proxy-support branch from 9433a17 to 97332c6 Compare August 27, 2024 11:39
Copy link

codecov bot commented Aug 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (14c79c0) to head (fa1f91a).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #682   +/-   ##
=======================================
  Coverage   98.54%   98.54%           
=======================================
  Files          21       21           
  Lines        3505     3517   +12     
=======================================
+ Hits         3454     3466   +12     
  Misses         51       51           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Aug 27, 2024

@gremid Could you do something about test coverage? We can see the docs later.

@gremid
Copy link
Contributor Author

gremid commented Aug 27, 2024

I moved the proxy-related tests from the GitHub Action setup to a parametrized pytest fixture. Thus, the coverage of both code paths in the HTTP fetch routines, with and without proxy setup, should be recorded now. The import and parsing of the environment variable http_proxy at the top of trafilatura.downloads still is not covered.

@gremid gremid force-pushed the feature/socks-proxy-support branch from a732852 to b740098 Compare August 28, 2024 06:42
@gremid
Copy link
Contributor Author

gremid commented Aug 28, 2024

Wrestling with GitHub Actions and environment settings … Now the variable PROXY_TEST, defined in the matrix context, should also be available in the test run.

@adbar
Copy link
Owner

adbar commented Aug 28, 2024

Thanks!

@adbar adbar merged commit b3aea4a into adbar:master Aug 28, 2024
15 checks passed
@andremacola
Copy link
Contributor

andremacola commented Sep 7, 2024

I haven't had time to look at Trafilatura for a while, but this PR caught my attention because in the past I had made a patch to add proxies as well. https:/adbar/trafilatura/pull/332/files

@gremid Congratulations on the tests; at the time, I couldn't give it the attention it deserved (sorry @adbar)

A few considerations:

  • In my view, the proxy shouldn't be used for every request. In this PR, by setting the http_proxy environment variable, Trafilatura assumes it's to be used for everything.
  • There are use cases where the proxy is only necessary for certain sites. In my opinion, this should be set at the time of the request by passing a boolean value (perhaps in trafilatura.fetch_url()).
  • Another situation is the need for alternating proxies between requests, and an environment variable applied in the current logic limits this approach.

In my PR, you might find some ideas to further improve this functionality: https:/adbar/trafilatura/pull/332/files

These are just suggestions. Again, congratulations on the PR.

@adbar
Copy link
Owner

adbar commented Sep 9, 2024

Thanks for the feedback. We won't work on this right now but I added your idea to the list (#697).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants