-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Sitemap-based request list implementation #2498
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request is neither linked to an issue or epic nor labeled as adhoc!
1a8014f
to
e16a76d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few initial comments
@@ -444,7 +444,7 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext | |||
* A reference to the underlying {@apilink RequestList} class that manages the crawler's {@apilink Request|requests}. | |||
* Only available if used by the crawler. | |||
*/ | |||
requestList?: RequestList; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, this is breaking for anybody who extends BasicCrawler
and accesses this. But anyone who does that should be able to deal with the change IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@B4nan is this cool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess it is fine, by breaking you mean only if they would depend on a specific API that is not part of the new interface, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. It seems improbable to me that anyone would do that, but life is full of surprises.
@@ -79,52 +285,6 @@ class SitemapTxtParser extends Writable { | |||
export class Sitemap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has become pretty hollow. I'd welcome any opinions on how to make it useful. Or can we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one idea, approving as long as you can say the migrations on platform are working as expected
How's this looking? Anything we can help with? |
Well, I want to add timeouts/cancellation. Also, we need to test if the |
Alright, ready for the next round of reviews! I simplified the parsing logic quite a lot (in my eyes) - in 9e4a620 adds a new helper method b1498a8 adds |
Alright, time for the (yet another) final review! My previous comment should provide enough guidance for the top-level ideas. |
This introduces an alternative RequestList implementation based on sitemaps. It should be possible to use this in tandem with RequestProvider in BasicCrawler, just like with the current RequestList.
In the future, this will make it possible to start crawling before the sitemap is finished loading.
TODO