Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Sitemap-based request list implementation #2498

Merged
merged 27 commits into from
Jul 4, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
fbe7ee9
refactor: Extract IRequestList interface from the RequestList class
janbuchar May 24, 2024
f20d14a
Implement a basic SitemapRequestList
janbuchar May 24, 2024
e16a76d
Lint
janbuchar May 24, 2024
27f6dd3
Refactor sitemap utils to allow processing in the background
janbuchar May 28, 2024
610e6b6
Handle all sitemap.xml fields
janbuchar May 28, 2024
d8951ef
Implement sitemap loading in the background
janbuchar May 29, 2024
09b3bb1
inheritdoc -> inheritDoc
janbuchar May 29, 2024
f45e957
Test sitemap metadata extraction
janbuchar May 29, 2024
13d7b62
Basic SitemapRequestList functionality tests
janbuchar May 29, 2024
a695f91
Fill in JSDoc
janbuchar May 29, 2024
a6fd628
Implement persistence + better queueing
janbuchar May 31, 2024
a4d77b7
Lint
janbuchar May 31, 2024
3baaa7b
chore: simplify `RequestList.load()` async syntax
barjin Jun 25, 2024
ff97283
chore: more fitting variable naming
barjin Jun 25, 2024
ec437fd
chore: get rid of `requestData` for the sake of simplicity
barjin Jun 25, 2024
1462b32
feat: more granular sitemap parsing persistence, simplify code
barjin Jun 26, 2024
9e4a620
feat: add `waitForNextRequest` helper for `IRequestList`
barjin Jun 26, 2024
b1498a8
feat: add `signal` and `timeoutMillis` options for `SitemapRequestList`
barjin Jun 26, 2024
31efd21
feat: remove the old `queuedUrlsBySitemap` property
barjin Jun 27, 2024
7e71afe
chore: improve comment wording
barjin Jun 27, 2024
99a12a6
chore: naming, documentation
barjin Jun 27, 2024
53bee4b
feat: minimizing the memory footprint with buffering streams
barjin Jun 27, 2024
7965373
feat: customizable `bufferSize`, `Request` changes persistence
barjin Jul 3, 2024
c53dba2
Merge branch 'master' into sitemap-request-list
barjin Jul 3, 2024
4b62f08
fix: forever waiting in RequestList.requestIterator
barjin Jul 3, 2024
a796104
fix: align the tests with the current interface
barjin Jul 3, 2024
ead067b
feat: swap `requestIterator` for `[Symbol.asyncIterator]`
barjin Jul 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions packages/core/src/storages/request_list.ts
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,17 @@ export interface IRequestList {
*/
fetchNextRequest(): Promise<Request | null>;

/**
* Gets the next {@apilink Request} to process. First, the function gets a request previously reclaimed
* using the {@apilink RequestList.reclaimRequest} function, if there is any.
* Otherwise it gets the next request from sources.
*
* The function resolves to `null` if there are no more requests to process.
*
* Unlike `fetchNextRequest()`, this function returns an `AsyncGenerator` that can be used in a `for await...of` loop.
*/
waitForNextRequest(): AsyncGenerator<Request | null>;
barjin marked this conversation as resolved.
Show resolved Hide resolved

/**
* Reclaims request to the list if its processing failed.
* The request will become available in the next `this.fetchNextRequest()`.
Expand Down Expand Up @@ -672,6 +683,15 @@ export class RequestList implements IRequestList {
return null;
}

/**
* @inheritDoc
*/
async *waitForNextRequest() {
while (true) {
yield await this.fetchNextRequest();
}
}

private ensureRequest(requestLike: Request | RequestOptions, index: number): Request {
if (requestLike instanceof Request) {
return requestLike;
Expand Down
Loading