Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use some list comprehensions, fall back from multi-chunk encoding #7

Merged
merged 9 commits into from
Apr 2, 2024

Conversation

bensteinberg
Copy link
Contributor

@bensteinberg bensteinberg commented Mar 28, 2024

For your delectation, but not necessarily for inclusion -- this change populates documents, ids, and embeddings with list comprehensions rather than in a loop. (When I tried forming metadatas this way, it didn't work, though I think it could be made to.) In my limited experiments, this is somewhat more than 3x faster than before, but I bet that could depend on VECTOR_SEARCH_SENTENCE_TRANSFORMER_MODEL ("BAAI/bge-m3" in this case).

Because the original code was super slow with BAAI/bge-m3 on 5-chunk pages, I also made the change to give one chunk at a time to embedding_model.encode() -- this may not be a good idea for other embedding models or other hardware setups. (Now see the comment below about multi-chunk mode.)

I'm not making any real claims about the performance of list comprehensions here, btw. I started making these changes for aesthetic reasons -- I think they look nice.

@bensteinberg
Copy link
Contributor Author

Oh, my experiments have been run with VECTOR_SEARCH_SENTENCE_TRANSFORMER_DEVICE="mps" fwiw.

@bensteinberg bensteinberg changed the title Use some list comprehensions Use some list comprehensions, fall back from multi-chunk encoding Mar 29, 2024
@bensteinberg
Copy link
Contributor Author

In some cases, it looks like the parallelization built into embedding_model.encode() gets stuck, and takes much longer than len(text_chunks) times the encoding time of a single chunk to encode multiple chunks at once. This PR now includes a mechanism for tracking the timings of each encoding, and comparing multi-chunk encodings with single-chunk encodings, with an arbitrary multiplier of 1.1: it tests encoding_time > len(text_chunks) * mean(one_chunk_times) * multiplier.

This does not output encoding times, but it could; it also stops tracking encoding times once it's fallen out of multi-chunk mode.

@@ -40,6 +42,10 @@ def ingest() -> None:
total_records = 0
total_embeddings = 0

encoding_timings = []
multiplier = 1.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor thing: this variable could use a more descriptive name given its scope

Copy link
Collaborator

@matteocargnelutti matteocargnelutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thank you @bensteinberg

Given that it increased in complexity, it might be worth moving the logic for generating documents, embeddings, metadatas and ids out of text_chunks into its own function / closure?

@bensteinberg
Copy link
Contributor Author

Given that it increased in complexity, it might be worth moving the logic for generating documents, embeddings, metadatas and ids out of text_chunks into its own function / closure?

Yeah, let me see if I can reorganize it to make it cleaner. I might want to keep the simpler parts inline and break out the embeddings.

@bensteinberg
Copy link
Contributor Author

I've broken out the per-chunk object creation into its own function; the call is slightly busy, since it needs to pass in the embedding model, and passes in and out the encoding timings for assessing multi-chunk performance. I removed the multiplier for assessing multi-chunk performance; ideally, the parallelization there would make per-chunk encoding time much shorter than 1x, so anything even in the ballpark of 1x should trigger the change to single-chunk mode.

@bensteinberg
Copy link
Contributor Author

I'm not sure I am doing type annotations correctly in the new function, or if it's necessary here.

@matteocargnelutti
Copy link
Collaborator

I have made some very minor suggestions / lints. This is awesome @bensteinberg -- thank you very much. Feel free to merge and I will release 0.1.1 with your changes.

@bensteinberg bensteinberg merged commit 63d7520 into harvard-lil:main Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants