Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingestion Speedup Multiple strategy #1309

Merged
merged 14 commits into from
Nov 25, 2023

Conversation

lopagela
Copy link
Contributor

Multiple strategy for file ingestion.

This is good to be pushed as is.
Following PR will come to add a switch between the implementation using the configuration.
Documentation should be added as well.
Ideally, the most optimal implementation should be debugged as well in this future PR.


The current behavior is preserved and will not impact the existing users

Multiprocessing map is used to perform CPU bound operations on the file that have to be ingested.

TODO: make the ingestion script to use the bulk method
This way, we do not have to use `global` keyword
Also made the argument parsing and the initialization of dependency context as late as possible, in order to fail fast and return the help faster.
The profile `ingest-local` does not load the LLM, but it does load the embedding model.
…he API

Create a shell (IngestionHelper) to hold the file to documents conversion, as well as the metadata modification.

Also, made the code simpler by simplifying the typing and the potential hack to perform if the `file_data` given the actual data, instead of a path to the file.
…nent

Exposing different ways of parallelization of file ingestion.
Copy link
Collaborator

@imartinez imartinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Small comments, mostly some clean up.

private_gpt/components/ingest/ingest_component.py Outdated Show resolved Hide resolved
private_gpt/components/ingest/ingest_component.py Outdated Show resolved Hide resolved
private_gpt/components/ingest/ingest_component.py Outdated Show resolved Hide resolved
private_gpt/server/ingest/ingest_service.py Outdated Show resolved Hide resolved
settings-ingest-local.yaml Outdated Show resolved Hide resolved
@imartinez imartinez merged commit bafdd3b into zylon-ai:main Nov 25, 2023
6 checks passed
Do not re-delete the metadata of the files (they were already tweaked in the IngestionHelper class).
@lopagela lopagela deleted the ingestion/speedup-thread branch November 27, 2023 10:44
@lopagela
Copy link
Contributor Author

lopagela commented Dec 1, 2023

Follow up PR that fixes the parallel ingestion: #1336

simonbermudez pushed a commit to simonbermudez/saimon that referenced this pull request Feb 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants