Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy changes for same session #2503

Open
1 task
harm-matthias-harms opened this issue May 27, 2024 · 1 comment
Open
1 task

Proxy changes for same session #2503

harm-matthias-harms opened this issue May 27, 2024 · 1 comment
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@harm-matthias-harms
Copy link

harm-matthias-harms commented May 27, 2024

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/browser (BrowserCrawler)

Issue description

According to the documentation the proxies and sessions are bound together to avoid blocking if the same sessions run with another IP address. The documentation gives a similar example:

new PlaywrightCrawler({
 ...
  useSessionPool: true,
  sessionPoolOptions: {
    sessionOptions: {
      maxUsageCount: 7
     ...
    }
  },
  proxyConfiguration: new ProxyConfiguration({ proxyUrls: proxyList() }), // List of 250 Proxy, same IP different Port
...
})

But if I check the proxy and session in the router, the session ID does not match the proxies' session ID:

log.info(session?.id)
log.info(proxyInfo?.port)
log.info(proxyInfo?.sessionId)

This outputs something like:

INFO  PlaywrightCrawler: session_AlZoomLhQU
INFO  PlaywrightCrawler: 10209
INFO  PlaywrightCrawler: session_Dnha2MhDeX
....
INFO  PlaywrightCrawler: session_AlZoomLhQU
INFO  PlaywrightCrawler: 10208
INFO  PlaywrightCrawler: session_6jOviCJSHt
...

The problem seems to be that the proxy is loaded before the page context is enhanced, which can change the session..

A local working solution is to load the proxy after the session is again loaded. This can be done by moving the code block below the last mentioned line.

After the change the output looks like this:

INFO  PlaywrightCrawler: session_zBwqeH4a7N
INFO  PlaywrightCrawler: 10204
INFO  PlaywrightCrawler: session_zBwqeH4a7N

I'm sorry for not providing a PR for this because I don't know if this has other implications and it's not easy for me to add an adequate test fast.

Related to https://discord.com/channels/801163717915574323/1243449005820874763

Code sample

No response

Package version

latest

Node.js version

20

Operating system

macOs

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@harm-matthias-harms harm-matthias-harms added the bug Something isn't working. label May 27, 2024
@fnesveda fnesveda added the t-tooling Issues with this label are in the ownership of the tooling team. label May 29, 2024
@barjin
Copy link
Contributor

barjin commented Jun 12, 2024

Thank you @harm-matthias-harms for bringing this up.

Indeed, there is an issue with the way we're handling the sessions in the browser crawlers. This is because a running browser instance can be reused for multiple requests, but will always have only one proxy URL / session tied to it (because of technical reasons).

We'll try to straighten this up in upcoming patches - in the meantime, you can get the expected behavior by switching the launchContext.useIncognitoPages crawler constructor parameter to true. Note that this tells Crawlee to use a new browser instance for each request, so it can worsen the performance of your crawlers. The actual numbers depend on your use case though.

const crawler = new PlaywrightCrawler({
    launchContext: {
        useIncognitoPages: true, // Use one browser per request, fixes the session pairing issues
    },
    requestHandler: async ({ enqueueLinks, session, proxyInfo }) => {
        ...
    }
});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants