`WebScraper` refactor into `scrapeURL` #714

mogery · 2024-09-28T22:31:47Z

Directives:

reduce state
- stateless, functional programming paradigms to reduce debugging complexity
  - state that is required (e.g. current logger) is passed in an immutable meta object
- sovereign modules that do not see the whole state (see e.g. engines/fire-engine/scrape.ts, it only has logger, not the whole meta object)
make the signal flow clear to ease debugging
- intense verbosity in logging
- modularity, make it clear where to add things in the future, make it easy to add things in the future without breaking stuff
  - define generic modules that can be implemented and appended to later (e.g. transformers, engines)
better error handling
- using a rust-like error model, exceptions are freely thrown instead of wrapping it into {success: false, error: ...} objects
- errors that occur are always re-thrown, with the original metadata (e.g. stack) intact. if doing stuff like retries with a limit, previous errors are passed along using the cause property.
- we may never swallow an error.
  - at points where errors are not directly re-thrown, rather put into an object/array (e.g. retry logic/EngineResultsTracker), non-expected errors should be explicitly logged and Sentry.captureException'd
- errors are only transformed into a success object at the top level of scrapeURL, in order to avoid breaking other parts of the codebase. errors are passed in the error metadata
- never determine what an error is by checking it's message -- if you need a specific error that is determinable by parts of the codebase, create a custom error class and use instanceof -- see error.ts for reference
standalone
- scrapeURL should never (even attempt to!) interface with the database. It should be its own standalone thing that could even be lifted out of firecrawl as a whole. To keep it fast, reliable, and maintainable, we need to keep its footprint minimal -- DB code can be handled by the surrounding bits that are tangled up in that anyways (e.g. queue-worker.ts)

… and DOCX support

nickscamara · 2024-10-03T16:59:49Z

Add sb
Integrate w/ v1
Make crawl not crash if scrapeURL throws

mogery added 13 commits September 28, 2024 13:15

feat: use strictNullChecking

7732c9d

feat: switch logger to Winston

76c082a

feat(scrapeURL): first batch

260c538

fix(scrapeURL): error swallow

cda065f

fix(scrapeURL): add timeout to EngineResultsTracker

1c5a29c

fix(scrapeURL): report unexpected error to sentry

775d994

chore: remove unused modules

4204223

feat(transfomers/coerce): warn when a format's response is missing

73ea367

feat(scrapeURL): feature flag priorities, engine quality sorting, PDF…

d7755af

… and DOCX support

(add note)

1576177

feat(scrapeURL): wip readme

7253a50

feat(scrapeURL): LLM extract

5a115fe

feat(scrapeURL): better warnings

0e6c6b7

mogery added 6 commits October 4, 2024 21:46

fix(scrapeURL/engines/fire-engine;playwright): fix screenshot

8adf50b

feat(scrapeURL): add forceEngine internal option

236a4e7

feat(scrapeURL/engines): scrapingbee

fc490b9

feat(scrapeURL/transformars): uploadScreenshot

29ca8ce

feat(scrapeURL): more intense tests

3355574

bunch of stuff

33cde05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`WebScraper` refactor into `scrapeURL` #714

`WebScraper` refactor into `scrapeURL` #714

mogery commented Sep 28, 2024 •

edited

Loading

nickscamara commented Oct 3, 2024

WebScraper refactor into scrapeURL #714

Are you sure you want to change the base?

WebScraper refactor into scrapeURL #714

Conversation

mogery commented Sep 28, 2024 • edited Loading

nickscamara commented Oct 3, 2024

`WebScraper` refactor into `scrapeURL` #714

`WebScraper` refactor into `scrapeURL` #714

mogery commented Sep 28, 2024 •

edited

Loading