Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TransfomHeadersAgent #26

Merged
merged 12 commits into from
Aug 4, 2021
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ This package can only generate all the standard attributes. There still might be
The second colossal factor is using the same HTTP version as browsers. Most modern browsers use HTTP v2, so using it when the server supports it takes you one more step further to look like a browser. Luckily you don't have to care about the server HTTP version support since `got` automatically handles [ALPN protocol negotiation](https://en.wikipedia.org/wiki/Application-Layer_Protocol_Negotiation) and uses the HTTP v2 only if it is supported.
The last step is to have a browser-like TLS suite and ciphers. According to our research, the cipher `TLS_AES_256_GCM_SHA384` is used among all modern browsers. We use this cipher as a default one. However, feel free to change it.

HTTP/1.1 headers are always automatically formatted in [`Pascal-Case`](https://pl.wikipedia.org/wiki/PascalCase). There is an expection: [`x-`](https://datatracker.ietf.org/doc/html/rfc7231#section-8.3.1) headers are not modified in *any* way.
szmarczak marked this conversation as resolved.
Show resolved Hide resolved

This package should provide a solid start for your browser request emulation process. All websites are built differently, and some of them might require some additional special care.

## Proxies
Expand Down
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
"eslint": "^7.0.0",
"express": "^4.17.1",
"fs-extra": "^9.1.0",
"get-stream": "^5.2.0",
"jest": "^26.6.3",
"jest-extended": "^0.11.5",
"jsdoc-to-markdown": "^7.0.0",
Expand Down
131 changes: 131 additions & 0 deletions src/agent/transform-headers-agent.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
/* eslint-disable no-underscore-dangle */
const http = require('http');
const WrappedAgent = require('./wrapped-agent');

const { _storeHeader } = http.OutgoingMessage.prototype;

/**
* @description Transforms the casing of the headers to Pascal-Case.
*/
class TransformHeadersAgent extends WrappedAgent {
szmarczak marked this conversation as resolved.
Show resolved Hide resolved
// Rewritten from https:/nodejs/node/blob/533cafcf7e3ab72e98a2478bc69aedfdf06d3a5e/lib/_http_outgoing.js#L442-L479
/**
* @description Transforms the request via header normalization.
* @see {TransformHeadersAgent.toPascalCase}
* @param {http.ClientRequest} request
* @param {string[]} sortedHeaders - headers in order, optional
*/
transformRequest(request, sortedHeaders) {
const headers = {};
const hasConnection = request.hasHeader('connection');
const hasContentLength = request.hasHeader('content-length');
const hasTransferEncoding = request.hasHeader('transfer-encoding');
const hasTrailer = request.hasHeader('trailer');
const keys = request.getHeaderNames();

for (const key of keys) {
if (key.toLowerCase().startsWith('x-')) {
headers[key] = request.getHeader(key);
} else {
headers[this.toPascalCase(key)] = request.getHeader(key);
}
Comment on lines +27 to +31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for the confusion. The x-something was just an example. The header could be my-header as well or some other randomness. We've seen quite a few.

Also I think there are some X- headers which are actually sent by the browsers like X-Requested-With.

So the determination of the "custom headers" is a bit more complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lemme patch this real quick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the generator returns in the correct casing then it should be no problem I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the generator should always return correct casing. Unless we encounter some crap UA like the applebot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yeah, I might be wrong about the X-Requested-With. I've seen it multiple times so I thought it's sent by the browser but maybe it's just common among developers to include it.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this PR is good to go I think. I'm planning to optimize HTTP/2 related stuff next such as the ALPN negotiation. I think there's no need to manually store the cache anymore.


if (sortedHeaders) {
// Removal is required in order to change the order of the properties
request.removeHeader(key);
}
}

if (!hasConnection) {
const shouldSendKeepAlive = request.shouldKeepAlive && (hasContentLength || request.useChunkedEncodingByDefault || request.agent);
if (shouldSendKeepAlive) {
headers.Connection = 'keep-alive';
} else {
headers.Connection = 'close';
}
}

if (!hasContentLength && !hasTransferEncoding) {
// Note: This uses private `_removedContLen` property.
// This property tells us whether the content-length was explicitly removed or not.
//
// Note: This uses private `_removedTE` property.
// This property tells us whether the transfer-encoding was explicitly removed or not.
if (!hasTrailer && !request._removedContLen && typeof request._contentLength === 'number') {
headers['Content-Length'] = request._contentLength;
} else if (!request._removedTE) {
headers['Transfer-Encoding'] = 'chunked';
}
}

const normalizedKeys = Object.keys(headers);
const sorted = sortedHeaders ? normalizedKeys.sort(this.createSort(sortedHeaders)) : normalizedKeys;

for (const key of sorted) {
request.setHeader(key, headers[key]);
}
}

addRequest(request, options) {
// See https:/nodejs/node/blob/533cafcf7e3ab72e98a2478bc69aedfdf06d3a5e/lib/_http_outgoing.js#L373
// Note: This overrides the private `_storeHeader`.
// This is required, because the function directly copies
// the `connection`, `content-length` and `trasfer-encoding` headers
// directly to the underlying buffer.
request._storeHeader = (...args) => {
this.transformRequest(request, options.sortedHeaders);

return _storeHeader.call(request, ...args);
};

return super.addRequest(request, options);
}

/**
* @param {string} header - header with unknown casing
* @returns {string} - header in Pascal-Case
*/
toPascalCase(header) {
return header.split('-').map((part) => {
return part[0].toUpperCase() + part.slice(1).toLowerCase();
}).join('-');
}

/**
*
* @see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort
* @param {string} a - header a
* @param {string} b - header b
* @param {string[]} sortedHeaders - array of headers in order
* @returns header a or header b, depending which one is more important
*/
sort(a, b, sortedHeaders) {
const rawA = sortedHeaders.indexOf(a);
const rawB = sortedHeaders.indexOf(b);
const indexA = rawA === -1 ? Number.POSITIVE_INFINITY : rawA;
const indexB = rawB === -1 ? Number.POSITIVE_INFINITY : rawB;

if (indexA < indexB) {
return -1;
}

if (indexA > indexB) {
return 1;
}

return 0;
}

/**
*
* @param {string[]} sortedHeaders - array of headers in order
* @returns {Function} - sort function
*/
createSort(sortedHeaders) {
const sortWithSortedHeaders = (a, b) => this.sort(a, b, sortedHeaders);

return sortWithSortedHeaders;
}
}

module.exports = TransformHeadersAgent;
42 changes: 42 additions & 0 deletions src/agent/wrapped-agent.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
/**
* @see https:/nodejs/node/blob/533cafcf7e3ab72e98a2478bc69aedfdf06d3a5e/lib/_http_client.js#L129-L162
* @see https:/nodejs/node/blob/533cafcf7e3ab72e98a2478bc69aedfdf06d3a5e/lib/_http_client.js#L234-L246
* @see https:/nodejs/node/blob/533cafcf7e3ab72e98a2478bc69aedfdf06d3a5e/lib/_http_client.js#L304-L305
* @description Wraps an existing Agent instance,
* so there's no need to replace `agent.addRequest`.
*/
class WrappedAgent {
constructor(agent) {
this.agent = agent;
}

addRequest(request, options) {
return this.agent.addRequest(request, options);
}

get keepAlive() {
return this.agent.keepAlive;
}

get maxSockets() {
return this.agent.maxSockets;
}

get options() {
return this.agent.options;
}

get defaultPort() {
return this.agent.defaultPort;
}

get protocol() {
return this.agent.protocol;
}

destroy() {
this.agent.destroy();
}
}

module.exports = WrappedAgent;
58 changes: 32 additions & 26 deletions src/hooks/proxy.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ const http2 = require('http2-wrapper');
const HttpsProxyAgent = require('https-proxy-agent');
const HttpProxyAgent = require('http-proxy-agent');
const httpResolver = require('../http-resolver');
const TransformHeadersAgent = require('../agent/transform-headers-agent');

const {
HttpOverHttp2,
Expand All @@ -25,27 +26,32 @@ exports.proxyHook = async function (options) {
const parsedProxy = new URL(proxyUrl);

validateProxyProtocol(parsedProxy.protocol);
const agents = await getAgents(parsedProxy, options.https.rejectUnauthorized);
options.agent = await getAgents(parsedProxy, options.https.rejectUnauthorized);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit torn whether this improves readability. I know it's shorter this way, but it made more sense to me when both the options were next to each other and not separated by the big comment. I don't have a strong opinion about this, but would like to learn why you think it's better this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's required. Otherwise it would fail if the user provided their own agents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because http2.request is used here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm blind, but I don't see any difference in behavior between:

const agents = await getAgents(parsedProxy, options.https.rejectUnauthorized);
if (resolvedRequestProtocol === 'http2') {
            options.agent = agents[resolvedRequestProtocol];
} else {
            options.agent = agents;
}

and

options.agent = await getAgents(parsedProxy, options.https.rejectUnauthorized);
if (resolvedRequestProtocol === 'http2') {
        options.agent = options.agent[resolvedRequestProtocol];
}

🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if (resolvedRequestProtocol === 'http2') { is outside if (proxyUrl) {, previously it was inside. You're not blind, it just may not be obvious at first sight. Let me add appropriate comments in this regards :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* The `if` below cannot be placed inside the `if` above.
* Otherwise `http2.request` would receive the entire `agent` object
* __when not using proxy__.
* ---
* `http2.request`, in contrary to `http2.auto`, expects an instance of `http2.Agent`.
* `http2.auto` expects an object with `http`, `https` and `http2` properties.

Unless you mean something else 🤔

}

/**
* This is needed because got expects all three agents in an object like this:
* {
* http: httpAgent,
* https: httpsAgent,
* http2: http2Agent,
* }
* The confusing thing is that internally, it destructures the agents out of
* the object for HTTP and HTTPS, but keeps the structure for HTTP2,
* because it passes all the agents down to http2.auto (from http2-wrapper).
* We're not using http2.auto, but http2.request, which expects a single agent.
* So for HTTP2, we need a single agent and for HTTP and HTTPS we need the object
* to allow destructuring of correct agents.
*/
if (resolvedRequestProtocol === 'http2') {
options.agent = agents[resolvedRequestProtocol];
} else {
options.agent = agents;
}
/**
* This is needed because got expects all three agents in an object like this:
* {
* http: httpAgent,
* https: httpsAgent,
* http2: http2Agent,
* }
* The confusing thing is that internally, it destructures the agents out of
* the object for HTTP and HTTPS, but keeps the structure for HTTP2,
* because it passes all the agents down to http2.auto (from http2-wrapper).
* We're not using http2.auto, but http2.request, which expects a single agent.
* So for HTTP2, we need a single agent and for HTTP and HTTPS we need the object
* to allow destructuring of correct agents.
* ---
* The `if` below cannot be placed inside the `if` above.
* Otherwise `http2.request` would receive the entire `agent` object
* __when not using proxy__.
* ---
* `http2.request`, in contrary to `http2.auto`, expects an instance of `http2.Agent`.
* `http2.auto` expects an object with `http`, `https` and `http2` properties.
*/
if (resolvedRequestProtocol === 'http2') {
options.agent = options.agent[resolvedRequestProtocol];
}
};

Expand Down Expand Up @@ -83,21 +89,21 @@ async function getAgents(parsedProxyUrl, rejectUnauthorized) {

if (proxyIsHttp2) {
agent = {
http: new HttpOverHttp2(proxy),
https: new HttpsOverHttp2(proxy),
http: new TransformHeadersAgent(new HttpOverHttp2(proxy)),
https: new TransformHeadersAgent(new HttpsOverHttp2(proxy)),
http2: new Http2OverHttp2(proxy),
};
} else {
agent = {
http: new HttpsProxyAgent(proxyUrl.href),
https: new HttpsProxyAgent(proxyUrl.href),
http: new TransformHeadersAgent(new HttpsProxyAgent(proxyUrl.href)),
https: new TransformHeadersAgent(new HttpsProxyAgent(proxyUrl.href)),
http2: new Http2OverHttps(proxy),
};
}
} else {
agent = {
http: new HttpProxyAgent(proxyUrl.href),
https: new HttpsProxyAgent(proxyUrl.href),
http: new TransformHeadersAgent(new HttpProxyAgent(proxyUrl.href)),
https: new TransformHeadersAgent(new HttpsProxyAgent(proxyUrl.href)),
http2: new Http2OverHttp(proxy),
};
}
Expand Down
8 changes: 8 additions & 0 deletions src/index.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
const http = require('http');
const https = require('https');

const got = require('got');
const HeaderGenerator = require('header-generator');

const TransformHeadersAgent = require('./agent/transform-headers-agent');
const { SCRAPING_DEFAULT_OPTIONS } = require('./scraping-defaults');

const { optionsValidationHandler } = require('./hooks/options-validation');
Expand All @@ -17,6 +21,10 @@ const gotScraping = got.extend({
context: {
headerGenerator: new HeaderGenerator(),
},
agent: {
http: new TransformHeadersAgent(http.globalAgent),
https: new TransformHeadersAgent(https.globalAgent),
},
hooks: {
init: [
(opts) => optionsValidationHandler(opts, () => {}),
Expand Down
Loading