Use some list comprehensions, fall back from multi-chunk encoding #7

bensteinberg · 2024-03-28T21:18:27Z

For your delectation, but not necessarily for inclusion -- this change populates documents, ids, and embeddings with list comprehensions rather than in a loop. (When I tried forming metadatas this way, it didn't work, though I think it could be made to.) In my limited experiments, this is somewhat more than 3x faster than before, but I bet that could depend on VECTOR_SEARCH_SENTENCE_TRANSFORMER_MODEL ("BAAI/bge-m3" in this case).

Because the original code was super slow with BAAI/bge-m3 on 5-chunk pages, I also made the change to give one chunk at a time to embedding_model.encode() -- this may not be a good idea for other embedding models or other hardware setups. (Now see the comment below about multi-chunk mode.)

I'm not making any real claims about the performance of list comprehensions here, btw. I started making these changes for aesthetic reasons -- I think they look nice.

bensteinberg · 2024-03-28T21:48:17Z

Oh, my experiments have been run with VECTOR_SEARCH_SENTENCE_TRANSFORMER_DEVICE="mps" fwiw.

bensteinberg · 2024-03-29T15:35:58Z

In some cases, it looks like the parallelization built into embedding_model.encode() gets stuck, and takes much longer than len(text_chunks) times the encoding time of a single chunk to encode multiple chunks at once. This PR now includes a mechanism for tracking the timings of each encoding, and comparing multi-chunk encodings with single-chunk encodings, with an arbitrary multiplier of 1.1: it tests encoding_time > len(text_chunks) * mean(one_chunk_times) * multiplier.

This does not output encoding times, but it could; it also stops tracking encoding times once it's fallen out of multi-chunk mode.

matteocargnelutti · 2024-03-29T16:07:08Z

warc_gpt/commands/ingest.py

@@ -40,6 +42,10 @@ def ingest() -> None:
 total_records = 0
 total_embeddings = 0

+ encoding_timings = []
+ multiplier = 1.1


Very minor thing: this variable could use a more descriptive name given its scope

matteocargnelutti

This is awesome, thank you @bensteinberg

Given that it increased in complexity, it might be worth moving the logic for generating documents, embeddings, metadatas and ids out of text_chunks into its own function / closure?

bensteinberg · 2024-03-29T16:35:02Z

Given that it increased in complexity, it might be worth moving the logic for generating documents, embeddings, metadatas and ids out of text_chunks into its own function / closure?

Yeah, let me see if I can reorganize it to make it cleaner. I might want to keep the simpler parts inline and break out the embeddings.

bensteinberg · 2024-04-01T18:03:45Z

I've broken out the per-chunk object creation into its own function; the call is slightly busy, since it needs to pass in the embedding model, and passes in and out the encoding timings for assessing multi-chunk performance. I removed the multiplier for assessing multi-chunk performance; ideally, the parallelization there would make per-chunk encoding time much shorter than 1x, so anything even in the ballpark of 1x should trigger the change to single-chunk mode.

bensteinberg · 2024-04-01T18:04:16Z

I'm not sure I am doing type annotations correctly in the new function, or if it's necessary here.

matteocargnelutti · 2024-04-01T19:34:00Z

I have made some very minor suggestions / lints. This is awesome @bensteinberg -- thank you very much. Feel free to merge and I will release 0.1.1 with your changes.

Use some list comprehensions

09e64e9

bensteinberg requested a review from matteocargnelutti March 28, 2024 21:18

Add mechanism for falling back from multi-chunk to single-chunk encoding

6624398

bensteinberg changed the title ~~Use some list comprehensions~~ Use some list comprehensions, fall back from multi-chunk encoding Mar 29, 2024

matteocargnelutti reviewed Mar 29, 2024

View reviewed changes

matteocargnelutti approved these changes Mar 29, 2024

View reviewed changes

bensteinberg added 4 commits April 1, 2024 13:16

Add perplexity option for small inputs, remove default n_components arg

235cd7d

Placate flake8

fb53dbe

Simplify multi-chunk mode, add option, replace a loop

3936549

Refactoring

6a3cd62

Linting / minor edits / README

97283e2

bensteinberg added 2 commits April 1, 2024 16:28

Pass multi-chunk mode in and out

1db77ee

Expose batch size instead of multi-chunk mode

237c63c

bensteinberg merged commit 63d7520 into harvard-lil:main Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use some list comprehensions, fall back from multi-chunk encoding #7

Use some list comprehensions, fall back from multi-chunk encoding #7

bensteinberg commented Mar 28, 2024 •

edited

Loading

bensteinberg commented Mar 28, 2024

bensteinberg commented Mar 29, 2024

matteocargnelutti Mar 29, 2024

matteocargnelutti left a comment

bensteinberg commented Mar 29, 2024

bensteinberg commented Apr 1, 2024

bensteinberg commented Apr 1, 2024

matteocargnelutti commented Apr 1, 2024

Use some list comprehensions, fall back from multi-chunk encoding #7

Use some list comprehensions, fall back from multi-chunk encoding #7

Conversation

bensteinberg commented Mar 28, 2024 • edited Loading

bensteinberg commented Mar 28, 2024

bensteinberg commented Mar 29, 2024

matteocargnelutti Mar 29, 2024

Choose a reason for hiding this comment

matteocargnelutti left a comment

Choose a reason for hiding this comment

bensteinberg commented Mar 29, 2024

bensteinberg commented Apr 1, 2024

bensteinberg commented Apr 1, 2024

matteocargnelutti commented Apr 1, 2024

bensteinberg commented Mar 28, 2024 •

edited

Loading