Ingestion Speedup Multiple strategy #1309

lopagela · 2023-11-25T13:08:12Z

Multiple strategy for file ingestion.

This is good to be pushed as is.
Following PR will come to add a switch between the implementation using the configuration.
Documentation should be added as well.
Ideally, the most optimal implementation should be debugged as well in this future PR.

The current behavior is preserved and will not impact the existing users

Multiprocessing map is used to perform CPU bound operations on the file that have to be ingested. TODO: make the ingestion script to use the bulk method

This way, we do not have to use `global` keyword

Also made the argument parsing and the initialization of dependency context as late as possible, in order to fail fast and return the help faster.

The profile `ingest-local` does not load the LLM, but it does load the embedding model.

…he API Create a shell (IngestionHelper) to hold the file to documents conversion, as well as the metadata modification. Also, made the code simpler by simplifying the typing and the potential hack to perform if the `file_data` given the actual data, instead of a path to the file.

…nent Exposing different ways of parallelization of file ingestion.

imartinez

Great work! Small comments, mostly some clean up.

private_gpt/components/ingest/ingest_component.py

private_gpt/server/ingest/ingest_service.py

settings-ingest-local.yaml

Do not re-delete the metadata of the files (they were already tweaked in the IngestionHelper class).

lopagela · 2023-12-01T13:51:31Z

Follow up PR that fixes the parallel ingestion: #1336

lopagela added 11 commits November 21, 2023 10:19

Refactor ingestion service to keep the index reference in memory

b52f18a

Created bulk_ingest method to parallelize the document transformation

f86b518

Multiprocessing map is used to perform CPU bound operations on the file that have to be ingested. TODO: make the ingestion script to use the bulk method

Fix typing

81cdc9b

Refactor ingestion script in object-oriented programing

518ecb9

This way, we do not have to use `global` keyword

Refactor ingestion script to use the bulk ingestion method

5ef562c

Also made the argument parsing and the initialization of dependency context as late as possible, in order to fail fast and return the help faster.

Extract some metadata exclusion and put it in doc parsing

40f76c5

make check run

c390f10

Remove the correlation between embeddings settings and llm settings

ce05ce5

Add a profile dedicated to local ingestion and add startup logs

6484d78

The profile `ingest-local` does not load the LLM, but it does load the embedding model.

Fix formatting and linting

43c7335

lopagela requested review from pabloogc and imartinez November 25, 2023 13:08

lopagela mentioned this pull request Nov 25, 2023

Speed up document ingestion #1279

Closed

Multiple strategy for file/document ingestion through ingestion compo…

94474cd

…nent Exposing different ways of parallelization of file ingestion.

imartinez requested changes Nov 25, 2023

View reviewed changes

lopagela requested a review from imartinez November 25, 2023 18:21

imartinez approved these changes Nov 25, 2023

View reviewed changes

imartinez merged commit bafdd3b into zylon-ai:main Nov 25, 2023
6 checks passed

lopagela added 2 commits November 25, 2023 20:18

PR review

a5409b4

Do not re-delete the metadata of the files (they were already tweaked in the IngestionHelper class).

Delete ingestion specific file that could lead to confusion

a0aea9a

lopagela deleted the ingestion/speedup-thread branch November 27, 2023 10:44

lopagela mentioned this pull request Nov 27, 2023

LLM and Embedding models interchangeability #1326

Closed

imartinez mentioned this pull request Dec 1, 2023

chore(main): release 0.1.0 #1094

Merged

simonbermudez pushed a commit to simonbermudez/saimon that referenced this pull request Feb 24, 2024

Ingestion Speedup Multiple strategy (zylon-ai#1309)

3ed7b1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingestion Speedup Multiple strategy #1309

Ingestion Speedup Multiple strategy #1309

lopagela commented Nov 25, 2023

imartinez left a comment

lopagela commented Dec 1, 2023

Ingestion Speedup Multiple strategy #1309

Ingestion Speedup Multiple strategy #1309

Conversation

lopagela commented Nov 25, 2023

imartinez left a comment

Choose a reason for hiding this comment

lopagela commented Dec 1, 2023